arxiv: 2604.09035 · v1 · submitted 2026-04-10 · 💻 cs.AI · cs.LG

Recognition: unknown

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

Daniele Foffano , Arvid Eriksson , David Broman , Karl H. Johansson , Alexandre Proutiere

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords advantage-guided diffusionmodel-based reinforcement learningdiffusion world modelspolicy improvementadvantage estimatesSigmoid Advantage GuidanceMuJoCo control tasks

0 comments

The pith

Advantage estimates guide diffusion models to sample higher-value trajectories and improve policies in model-based RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that steering the reverse diffusion process in world models using advantage estimates concentrates generated trajectories on those expected to produce higher long-term returns. This approach solves the short-horizon myopia of reward-based guides by incorporating value information from advantages rather than relying solely on immediate rewards or policy actions. Under standard assumptions, the guidance enables reweighted sampling where trajectory weights increase with state-action advantage, which implies measurable policy improvement. The method integrates directly into existing diffusion architectures without retraining and yields better sample efficiency and returns on continuous control tasks.

Core claim

We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL) and develop Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG). We prove that guiding a diffusion model through SAG or EAG permits reweighted sampling of trajectories with weights that increase in state-action advantage, implying policy improvement under standard assumptions. We further show that trajectories generated from AGD-MBRL follow an improved policy with higher value than those from an unguided diffusion model. AGD integrates with PolyGRAD-style models by guiding only state components while leaving actions policy-conditioned and requires no change to the diffusion training objective.

What carries the argument

Advantage-Guided Diffusion (AGD) via SAG or EAG, which steers the reverse diffusion sampling toward trajectories with higher state-action advantages while keeping action generation policy-conditioned.

If this is right

Trajectories sampled under SAG or EAG follow policies with strictly higher value than unguided diffusion trajectories.
The reweighted sampling concentrates probability mass on actions whose advantages are positive, directly supporting policy improvement.
AGD-MBRL achieves higher sample efficiency and final returns than PolyGRAD, reward-guided diffusion, and model-free methods on MuJoCo tasks.
No modification to the diffusion training objective is needed, so existing models can adopt the guidance at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same advantage-steering idea could be tested on other generative world models such as flow-matching or autoregressive transformers to check whether the improvement guarantee generalizes.
In sparse-reward or long-horizon settings the method might allow shorter diffusion windows without loss of planning quality.
If advantage estimates contain systematic bias the reweighting may concentrate on locally attractive but globally suboptimal trajectories.

Load-bearing premise

Advantage estimates computed from the current policy are accurate and the diffusion model is trained well enough for the guidance to shift sampling toward genuinely higher-value trajectories.

What would settle it

An experiment showing that AGD-generated trajectories produce no increase in average return or policy value compared to unguided diffusion when advantage estimates are held fixed and accurate.

Figures

Figures reproduced from arXiv: 2604.09035 by Alexandre Proutiere, Arvid Eriksson, Daniele Foffano, David Broman, Karl H. Johansson.

**Figure 2.** Figure 2: Training curves for the Mujoco environments: HalfCheetah, Hopper, Walker and Reacher. Shaded areas indicate the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Advantage guidance for diffusion MBRL shows decent empirical gains on MuJoCo but the reweighting proof for policy improvement has a gap from guiding only states while actions stay policy-conditioned.

read the letter

The paper's main move is to steer diffusion trajectory sampling in MBRL using the agent's advantage estimates instead of just rewards or the policy. They define two concrete guides, SAG and EAG, and integrate them into PolyGRAD-style models by modifying only the state components during reverse diffusion. Training stays unchanged, which keeps things simple. On HalfCheetah, Hopper, Walker2D and Reacher the method improves sample efficiency and final returns over the unguided baseline, a reward-guided diffuser, and PPO/TRPO, with gains up to 2x in some runs. That is useful practical evidence for people already working with diffusion world models in continuous control.

Referee Report

3 major / 2 minor

Summary. The paper proposes Advantage-Guided Diffusion for Model-Based Reinforcement Learning (AGD-MBRL), which steers the reverse process of a diffusion world model using the agent's advantage estimates via two new guides (Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG)). It claims to prove that this enables reweighted trajectory sampling with weights increasing in state-action advantage A(s,a), implying policy improvement under standard assumptions, and that the generated trajectories follow a strictly improved policy relative to an unguided diffusion model. The method integrates with PolyGRAD-style architectures by guiding only state components while leaving actions policy-conditioned, requires no change to the diffusion training objective, and reports improved sample efficiency and returns on MuJoCo tasks (HalfCheetah, Hopper, Walker2D, Reacher) over PolyGRAD, reward-guided baselines, and model-free methods like PPO/TRPO.

Significance. If the central theoretical claims hold, the work would offer a principled mechanism to inject long-horizon value information into diffusion-based world models, directly addressing short-horizon myopia without altering the training loss or architecture. The reported empirical gains (up to 2x in some cases) on standard continuous-control benchmarks would indicate practical utility for MBRL. Strengths include the seamless PolyGRAD compatibility and the focus on advantage rather than raw rewards; however, the dependence on self-generated advantage estimates introduces a potential circularity that must be resolved for the improvement guarantee to be robust.

major comments (3)

[Abstract and theoretical proofs section] Abstract and the section containing the policy-improvement proofs: The claim that SAG/EAG guidance performs reweighted sampling of trajectories with weights increasing in state-action advantage A(s,a) (thereby implying policy improvement) is load-bearing. Yet the architecture applies guidance exclusively to state components while action generation remains fully conditioned on the unguided current policy. Because A(s,a) is a joint function of state and action, reweighting the marginal state trajectory does not automatically reweight the joint (s,a) measure; the resulting distribution therefore need not correspond to sampling from a policy with strictly higher value, even under standard MDP assumptions. A formal reduction showing how the conditional action sampling preserves the reweighting guarantee is required.
[Theoretical proofs section] The section stating the proofs and assumptions: The proofs are asserted to hold 'under standard assumptions' (accurate advantage estimates, well-trained diffusion model, standard MDP properties), but the manuscript does not explicitly list or verify these assumptions, nor does it provide the full derivations or error analysis. Without this, the support for the central policy-improvement claim cannot be verified, especially given the state-only guidance architecture.
[Experimental results section] Experimental results section (MuJoCo evaluations): Performance improvements over PolyGRAD and model-free baselines are reported without visible error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the claimed gains (including 2x margins) are reliable or could be explained by variance in the advantage estimates or diffusion sampling.

minor comments (2)

[Method section] The definitions of SAG and EAG (sigmoid and exponential forms) should be presented with explicit mathematical formulas in the main text rather than deferred to appendices, to improve readability of the guidance mechanism.
[Notation and method] Notation for the guided reverse process and the reweighting weights could be unified across the theoretical and experimental sections to avoid ambiguity when comparing guided vs. unguided trajectories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications on the theoretical claims and indicating the revisions incorporated into the manuscript.

read point-by-point responses

Referee: [Abstract and theoretical proofs section] Abstract and the section containing the policy-improvement proofs: The claim that SAG/EAG guidance performs reweighted sampling of trajectories with weights increasing in state-action advantage A(s,a) (thereby implying policy improvement) is load-bearing. Yet the architecture applies guidance exclusively to state components while action generation remains fully conditioned on the unguided current policy. Because A(s,a) is a joint function of state and action, reweighting the marginal state trajectory does not automatically reweight the joint (s,a) measure; the resulting distribution therefore need not correspond to sampling from a policy with strictly higher value, even under standard MDP assumptions. A formal reduction showing how the conditional action sampling preserves the reweighting guarantee is required.

Authors: We appreciate the referee's precise identification of the joint versus marginal distinction. Although guidance is applied only to states, actions are sampled conditionally from the fixed policy given those states. This structure induces an effective reweighting on the joint (s,a) measure because the guided state marginal is multiplied by the policy's conditional action probabilities. We have added a formal lemma (Lemma 3.2 in the revised theoretical section) that derives the joint reweighting factor explicitly and shows it remains monotonic in A(s,a) under the policy, thereby preserving the policy-improvement guarantee. The proof is included in the main text with a short sketch and full derivation moved to the appendix. revision: yes
Referee: [Theoretical proofs section] The section stating the proofs and assumptions: The proofs are asserted to hold 'under standard assumptions' (accurate advantage estimates, well-trained diffusion model, standard MDP properties), but the manuscript does not explicitly list or verify these assumptions, nor does it provide the full derivations or error analysis. Without this, the support for the central policy-improvement claim cannot be verified, especially given the state-only guidance architecture.

Authors: We agree that explicit enumeration of assumptions improves verifiability. The revised manuscript now contains a dedicated 'Assumptions' subsection (Section 3.1) that lists all required conditions, including bounded advantage estimation error, sufficient diffusion model capacity, and standard MDP properties (finite horizon, bounded rewards). Full derivations of the reweighting and policy-improvement results have been moved to Appendix B, and we have added a brief error-propagation analysis showing that small advantage estimation errors lead to correspondingly bounded degradation in the improvement guarantee. revision: yes
Referee: [Experimental results section] Experimental results section (MuJoCo evaluations): Performance improvements over PolyGRAD and model-free baselines are reported without visible error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the claimed gains (including 2x margins) are reliable or could be explained by variance in the advantage estimates or diffusion sampling.

Authors: The referee correctly notes the missing statistical details. We have revised all result figures and tables to display error bars corresponding to standard error across 5 independent random seeds. The experimental section now explicitly states the seed count and includes paired t-test p-values for all reported comparisons against PolyGRAD, reward-guided baselines, and model-free methods, confirming statistical significance of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the claimed proof of policy improvement

full rationale

The paper presents a proof that SAG/EAG guidance enables reweighted sampling with weights increasing in state-action advantage, implying policy improvement under standard assumptions, plus a separate claim that generated trajectories follow a higher-value policy than unguided diffusion. These are theoretical statements relying on MDP properties and accurate advantage estimates, which is standard in RL and does not reduce the result to a tautology, fitted parameter, or self-citation chain by construction. The architecture note about guiding only states while keeping actions policy-conditioned is presented as an integration detail without altering the training objective, but introduces no self-definitional or load-bearing reduction visible in the abstract or described claims. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on existing diffusion-model and RL frameworks; the new elements are the guidance functions and the claimed proofs. No new physical entities or large numbers of free parameters are introduced in the abstract.

axioms (1)

domain assumption Standard assumptions for policy improvement in RL (accurate advantage estimates, MDP properties)
Invoked when claiming that reweighted sampling implies policy improvement.

pith-pipeline@v0.9.0 · 5600 in / 1327 out tokens · 49615 ms · 2026-05-10T17:16:34.889165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 32 canonical work pages · 11 internal anchors

[1]

A graph placement methodology for fast chip design,

A. Mirhoseini, A. Goldie, M. Yazgan, J. W. Jiang, E. Songhori, S. Wang, Y .-J. Lee, E. Johnson, O. Pathak, A. Nazi,et al., “A graph placement methodology for fast chip design,”Nature, vol. 594, no. 7862, pp. 207–212, 2021

2021
[2]

Grandmaster level in starcraft ii using multi-agent reinforcement learning,

O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev,et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,”nature, vol. 575, no. 7782, pp. 350–354, 2019

2019
[3]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel,et al., “Mastering chess and shogi by self-play with a general reinforcement learning algorithm,”arXiv preprint arXiv:1712.01815, 2017

work page Pith review arXiv 2017
[4]

Mastering the game of go with deep neural networks and tree search,

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot,et al., “Mastering the game of go with deep neural networks and tree search,”nature, vol. 529, no. 7587, pp. 484–489, 2016

2016
[5]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015

2015
[6]

Model-Ensemble Trust-Region Policy Optimization

T. Kurutach, I. Clavera, Y . Duan, A. Tamar, and P. Abbeel, “Model-ensemble trust-region policy optimization,”arXiv preprint arXiv:1802.10592, 2018

work page Pith review arXiv 2018
[7]

Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022

V . Micheli, E. Alonso, and F. Fleuret, “Transformers are sample- efficient world models,”arXiv preprint arXiv:2209.00588, 2022

work page arXiv 2022
[8]

Transformer-based world models are happy with 100k interactions.arXiv preprint arXiv:2303.07109, 2023

J. Robine, M. H ¨oftmann, T. Uelwer, and S. Harmeling, “Transformer- based world models are happy with 100k interactions,”arXiv preprint arXiv:2303.07109, 2023

work page arXiv 2023
[9]

arXiv preprint arXiv:2305.10912 , year=

I. Schubert, J. Zhang, J. Bruce, S. Bechtle, E. Parisotto, M. Riedmiller, J. T. Springenberg, A. Byravan, L. Hasenclever, and N. Heess, “A gen- eralist dynamics model for control,”arXiv preprint arXiv:2305.10912, 2023

work page arXiv 2023
[10]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering diverse domains through world models,”arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review arXiv 2023
[11]

Planning with Diffusion for Flexible Behavior Synthesis

M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine, “Plan- ning with diffusion for flexible behavior synthesis,”arXiv preprint arXiv:2205.09991, 2022

work page internal anchor Pith review arXiv 2022
[12]

World models via policy-guided trajectory diffusion,

M. Rigter, J. Yamada, and I. Posner, “World models via policy-guided trajectory diffusion,”arXiv preprint arXiv:2312.08533, 2023

work page arXiv 2023
[13]

Policy-guided diffusion.arXiv preprint arXiv:2404.06356, 2024

M. T. Jackson, M. T. Matthews, C. Lu, B. Ellis, S. Whiteson, and J. Fo- erster, “Policy-guided diffusion,”arXiv preprint arXiv:2404.06356, 2024

work page arXiv 2024
[14]

Adversarial diffusion for ro- bust reinforcement learning,

D. Foffano, A. Russo, and A. Proutiere, “Adversarial diffusion for ro- bust reinforcement learning,”arXiv preprint arXiv:2509.23846, 2025

work page arXiv 2025
[15]

Neural network dynamics for model-based deep reinforcement learning with model- free fine-tuning,

A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model- free fine-tuning,” in2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566, IEEE, 2018

2018
[16]

Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,

K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,”Advances in neural information processing systems, vol. 31, 2018

2018
[17]

Model-Based Reinforcement Learning for Atari

L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Camp- bell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al., “Model-based reinforcement learning for Atari,”arXiv preprint arXiv:1903.00374, 2019

work page arXiv 1903
[18]

Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

T. Jafferjee, E. Imani, E. Talvitie, M. White, and M. Bowling, “Hallucinating value: A pitfall of dyna-style planning with imperfect environment models,”arXiv preprint arXiv:2006.04363, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[19]

Plannable approximations to mdp homomorphisms: Equivariance under actions,

E. van der Pol, T. Kipf, F. A. Oliehoek, and M. Welling, “Plannable approximations to mdp homomorphisms: Equivariance under actions,” arXiv preprint arXiv:2002.11963, 2020

work page arXiv 2002
[20]

Auto-Encoding Variational Bayes

D. P. Kingma, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[21]

Attention is all you need,

A. Vaswani, “Attention is all you need,”Advances in Neural Informa- tion Processing Systems, 2017

2017
[22]

Mastering Atari with Discrete World Models

D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba, “Mastering Atari with discrete world models,”arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review arXiv 2010
[23]

Recurrent world models facilitate pol- icy evolution,

D. Ha and J. Schmidhuber, “Recurrent world models facilitate pol- icy evolution,”Advances in neural information processing systems, vol. 31, 2018

2018
[24]

Learning to combat compounding-error in model-based reinforcement learning.arXiv preprint arXiv:1912.11206, 2019

C. Xiao, Y . Wu, C. Ma, D. Schuurmans, and M. M ¨uller, “Learning to combat compounding-error in model-based reinforcement learning,” arXiv preprint arXiv:1912.11206, 2019

work page arXiv 1912
[25]

Combating the Compounding-Error Problem with a Multi-step Model

K. Asadi, D. Misra, S. Kim, and M. L. Littman, “Combating the compounding-error problem with a multi-step model,”arXiv preprint arXiv:1905.13320, 2019

work page Pith review arXiv 1905
[26]

Deep unsupervised learning using nonequilibrium thermodynamics,

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inInternational conference on machine learning, pp. 2256–2265, PMLR, 2015

2015
[27]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

2020
[28]

Diffusion models beat gans on image synthesis,

P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021

2021
[29]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Is Conditional Generative Modeling all you need for Decision-Making?

A. Ajay, Y . Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal, “Is conditional generative modeling all you need for decision- making?,”arXiv preprint arXiv:2211.15657, 2022

work page internal anchor Pith review arXiv 2022
[31]

2305.17330 , archivePrefix=

Z. Zhu, M. Liu, L. Mao, B. Kang, M. Xu, Y . Yu, S. Ermon, and W. Zhang, “Madiff: Offline multi-agent learning with diffusion models,”arXiv preprint arXiv:2305.17330, 2023

work page arXiv 2023
[32]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, p. 02783649241273668, 2023

2023
[33]

Crossway diffu- sion: Improving diffusion-based visuomotor policy via self-supervised learning,

X. Li, V . Belagali, J. Shang, and M. S. Ryoo, “Crossway diffu- sion: Improving diffusion-based visuomotor policy via self-supervised learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 16841–16849, IEEE, 2024

2024
[34]

AdaptDiffuser: Diffusion models as adaptive self-evolving planners.arXiv preprint arXiv:2302.01877, 2023

Z. Liang, Y . Mu, M. Ding, F. Ni, M. Tomizuka, and P. Luo, “Adapt- diffuser: Diffusion models as adaptive self-evolving planners,”arXiv preprint arXiv:2302.01877, 2023

work page arXiv 2023
[35]

Benchmarking Model-Based Reinforcement Learning

T. Wang, X. Bao, I. Clavera, J. Hoang, Y . Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba, “Benchmarking model-based rein- forcement learning,”arXiv preprint arXiv:1907.02057, 2019

work page Pith review arXiv 1907
[36]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine, “IDQL: Implicit q-learning as an actor-critic method with diffusion policies,”arXiv preprint arXiv:2304.10573, 2023

work page internal anchor Pith review arXiv 2023
[37]

Value function estimation using conditional diffusion models for control,

B. Mazoure, W. Talbott, M. A. Bautista, D. Hjelm, A. Toshev, and J. Susskind, “Value function estimation using conditional diffusion models for control,”arXiv preprint arXiv:2306.07290, 2023

work page arXiv 2023
[38]

Dyna, an integrated architecture for learning, planning, and reacting,

R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,”ACM Sigart Bulletin, vol. 2, no. 4, pp. 160–163, 1991

1991
[39]

Trust Region Policy Optimization

J. Schulman, “Trust region policy optimization,”arXiv preprint arXiv:1502.05477, 2015

work page Pith review arXiv 2015
[40]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Safe offline reinforcement learning with feasibility-guided diffusion model.arXiv preprint arXiv:2401.10700,

Y . Zheng, J. Li, D. Yu, Y . Yang, S. E. Li, X. Zhan, and J. Liu, “Safe offline reinforcement learning with feasibility-guided diffusion model,” arXiv preprint arXiv:2401.10700, 2024

work page arXiv 2024
[42]

Diffusion spectral representation for reinforcement learning,

D. Shribak, C.-X. Gao, Y . Li, C. Xiao, and B. Dai, “Diffusion spectral representation for reinforcement learning,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 110028–110056, 2024

2024
[43]

Prior-guided diffu- sion planning for offline reinforcement learning,

D. Ki, J. Oh, S.-W. Shim, and B.-J. Lee, “Prior-guided diffu- sion planning for offline reinforcement learning,”arXiv preprint arXiv:2505.10881, 2025

work page arXiv 2025
[44]

Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

H. Ma, T. Chen, K. Wang, N. Li, and B. Dai, “Efficient on- line reinforcement learning for diffusion policy,”arXiv preprint arXiv:2502.00361, 2025

work page arXiv 2025
[45]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

S. Levine, “Reinforcement learning and control as probabilistic infer- ence: Tutorial and review,”arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review arXiv 2018
[46]

R. S. Sutton, A. G. Barto,et al.,Reinforcement learning: An intro- duction. No. 1, MIT press Cambridge, 1998

1998
[47]

Asynchronous methods for deep rein- forcement learning,

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- forcement learning,” inInternational conference on machine learning, pp. 1928–1937, PmLR, 2016

1928
[48]

Repaint: Inpainting using denoising diffusion prob- abilistic models,

A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion prob- abilistic models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11461–11471, 2022

2022
[49]

Stable-baselines3: Reliable reinforcement learning implemen- tations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implemen- tations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021

2021
[50]

High-Resolution Image Synthesis with Latent Diffusion Models

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models. arxiv 2022,”arXiv preprint arXiv:2112.10752, 2021

work page Pith review arXiv 2022
[51]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022