Recognition: unknown
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3
The pith
Advantage estimates guide diffusion models to sample higher-value trajectories and improve policies in model-based RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL) and develop Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG). We prove that guiding a diffusion model through SAG or EAG permits reweighted sampling of trajectories with weights that increase in state-action advantage, implying policy improvement under standard assumptions. We further show that trajectories generated from AGD-MBRL follow an improved policy with higher value than those from an unguided diffusion model. AGD integrates with PolyGRAD-style models by guiding only state components while leaving actions policy-conditioned and requires no change to the diffusion training objective.
What carries the argument
Advantage-Guided Diffusion (AGD) via SAG or EAG, which steers the reverse diffusion sampling toward trajectories with higher state-action advantages while keeping action generation policy-conditioned.
If this is right
- Trajectories sampled under SAG or EAG follow policies with strictly higher value than unguided diffusion trajectories.
- The reweighted sampling concentrates probability mass on actions whose advantages are positive, directly supporting policy improvement.
- AGD-MBRL achieves higher sample efficiency and final returns than PolyGRAD, reward-guided diffusion, and model-free methods on MuJoCo tasks.
- No modification to the diffusion training objective is needed, so existing models can adopt the guidance at inference time.
Where Pith is reading between the lines
- The same advantage-steering idea could be tested on other generative world models such as flow-matching or autoregressive transformers to check whether the improvement guarantee generalizes.
- In sparse-reward or long-horizon settings the method might allow shorter diffusion windows without loss of planning quality.
- If advantage estimates contain systematic bias the reweighting may concentrate on locally attractive but globally suboptimal trajectories.
Load-bearing premise
Advantage estimates computed from the current policy are accurate and the diffusion model is trained well enough for the guidance to shift sampling toward genuinely higher-value trajectories.
What would settle it
An experiment showing that AGD-generated trajectories produce no increase in average return or policy value compared to unguided diffusion when advantage estimates are held fixed and accurate.
Figures
read the original abstract
Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Advantage-Guided Diffusion for Model-Based Reinforcement Learning (AGD-MBRL), which steers the reverse process of a diffusion world model using the agent's advantage estimates via two new guides (Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG)). It claims to prove that this enables reweighted trajectory sampling with weights increasing in state-action advantage A(s,a), implying policy improvement under standard assumptions, and that the generated trajectories follow a strictly improved policy relative to an unguided diffusion model. The method integrates with PolyGRAD-style architectures by guiding only state components while leaving actions policy-conditioned, requires no change to the diffusion training objective, and reports improved sample efficiency and returns on MuJoCo tasks (HalfCheetah, Hopper, Walker2D, Reacher) over PolyGRAD, reward-guided baselines, and model-free methods like PPO/TRPO.
Significance. If the central theoretical claims hold, the work would offer a principled mechanism to inject long-horizon value information into diffusion-based world models, directly addressing short-horizon myopia without altering the training loss or architecture. The reported empirical gains (up to 2x in some cases) on standard continuous-control benchmarks would indicate practical utility for MBRL. Strengths include the seamless PolyGRAD compatibility and the focus on advantage rather than raw rewards; however, the dependence on self-generated advantage estimates introduces a potential circularity that must be resolved for the improvement guarantee to be robust.
major comments (3)
- [Abstract and theoretical proofs section] Abstract and the section containing the policy-improvement proofs: The claim that SAG/EAG guidance performs reweighted sampling of trajectories with weights increasing in state-action advantage A(s,a) (thereby implying policy improvement) is load-bearing. Yet the architecture applies guidance exclusively to state components while action generation remains fully conditioned on the unguided current policy. Because A(s,a) is a joint function of state and action, reweighting the marginal state trajectory does not automatically reweight the joint (s,a) measure; the resulting distribution therefore need not correspond to sampling from a policy with strictly higher value, even under standard MDP assumptions. A formal reduction showing how the conditional action sampling preserves the reweighting guarantee is required.
- [Theoretical proofs section] The section stating the proofs and assumptions: The proofs are asserted to hold 'under standard assumptions' (accurate advantage estimates, well-trained diffusion model, standard MDP properties), but the manuscript does not explicitly list or verify these assumptions, nor does it provide the full derivations or error analysis. Without this, the support for the central policy-improvement claim cannot be verified, especially given the state-only guidance architecture.
- [Experimental results section] Experimental results section (MuJoCo evaluations): Performance improvements over PolyGRAD and model-free baselines are reported without visible error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the claimed gains (including 2x margins) are reliable or could be explained by variance in the advantage estimates or diffusion sampling.
minor comments (2)
- [Method section] The definitions of SAG and EAG (sigmoid and exponential forms) should be presented with explicit mathematical formulas in the main text rather than deferred to appendices, to improve readability of the guidance mechanism.
- [Notation and method] Notation for the guided reverse process and the reweighting weights could be unified across the theoretical and experimental sections to avoid ambiguity when comparing guided vs. unguided trajectories.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications on the theoretical claims and indicating the revisions incorporated into the manuscript.
read point-by-point responses
-
Referee: [Abstract and theoretical proofs section] Abstract and the section containing the policy-improvement proofs: The claim that SAG/EAG guidance performs reweighted sampling of trajectories with weights increasing in state-action advantage A(s,a) (thereby implying policy improvement) is load-bearing. Yet the architecture applies guidance exclusively to state components while action generation remains fully conditioned on the unguided current policy. Because A(s,a) is a joint function of state and action, reweighting the marginal state trajectory does not automatically reweight the joint (s,a) measure; the resulting distribution therefore need not correspond to sampling from a policy with strictly higher value, even under standard MDP assumptions. A formal reduction showing how the conditional action sampling preserves the reweighting guarantee is required.
Authors: We appreciate the referee's precise identification of the joint versus marginal distinction. Although guidance is applied only to states, actions are sampled conditionally from the fixed policy given those states. This structure induces an effective reweighting on the joint (s,a) measure because the guided state marginal is multiplied by the policy's conditional action probabilities. We have added a formal lemma (Lemma 3.2 in the revised theoretical section) that derives the joint reweighting factor explicitly and shows it remains monotonic in A(s,a) under the policy, thereby preserving the policy-improvement guarantee. The proof is included in the main text with a short sketch and full derivation moved to the appendix. revision: yes
-
Referee: [Theoretical proofs section] The section stating the proofs and assumptions: The proofs are asserted to hold 'under standard assumptions' (accurate advantage estimates, well-trained diffusion model, standard MDP properties), but the manuscript does not explicitly list or verify these assumptions, nor does it provide the full derivations or error analysis. Without this, the support for the central policy-improvement claim cannot be verified, especially given the state-only guidance architecture.
Authors: We agree that explicit enumeration of assumptions improves verifiability. The revised manuscript now contains a dedicated 'Assumptions' subsection (Section 3.1) that lists all required conditions, including bounded advantage estimation error, sufficient diffusion model capacity, and standard MDP properties (finite horizon, bounded rewards). Full derivations of the reweighting and policy-improvement results have been moved to Appendix B, and we have added a brief error-propagation analysis showing that small advantage estimation errors lead to correspondingly bounded degradation in the improvement guarantee. revision: yes
-
Referee: [Experimental results section] Experimental results section (MuJoCo evaluations): Performance improvements over PolyGRAD and model-free baselines are reported without visible error bars, number of random seeds, or statistical significance tests. This makes it impossible to assess whether the claimed gains (including 2x margins) are reliable or could be explained by variance in the advantage estimates or diffusion sampling.
Authors: The referee correctly notes the missing statistical details. We have revised all result figures and tables to display error bars corresponding to standard error across 5 independent random seeds. The experimental section now explicitly states the seed count and includes paired t-test p-values for all reported comparisons against PolyGRAD, reward-guided baselines, and model-free methods, confirming statistical significance of the observed improvements. revision: yes
Circularity Check
No significant circularity in the claimed proof of policy improvement
full rationale
The paper presents a proof that SAG/EAG guidance enables reweighted sampling with weights increasing in state-action advantage, implying policy improvement under standard assumptions, plus a separate claim that generated trajectories follow a higher-value policy than unguided diffusion. These are theoretical statements relying on MDP properties and accurate advantage estimates, which is standard in RL and does not reduce the result to a tautology, fitted parameter, or self-citation chain by construction. The architecture note about guiding only states while keeping actions policy-conditioned is presented as an integration detail without altering the training objective, but introduces no self-definitional or load-bearing reduction visible in the abstract or described claims. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions for policy improvement in RL (accurate advantage estimates, MDP properties)
Reference graph
Works this paper leans on
-
[1]
A graph placement methodology for fast chip design,
A. Mirhoseini, A. Goldie, M. Yazgan, J. W. Jiang, E. Songhori, S. Wang, Y .-J. Lee, E. Johnson, O. Pathak, A. Nazi,et al., “A graph placement methodology for fast chip design,”Nature, vol. 594, no. 7862, pp. 207–212, 2021
2021
-
[2]
Grandmaster level in starcraft ii using multi-agent reinforcement learning,
O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev,et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,”nature, vol. 575, no. 7782, pp. 350–354, 2019
2019
-
[3]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel,et al., “Mastering chess and shogi by self-play with a general reinforcement learning algorithm,”arXiv preprint arXiv:1712.01815, 2017
work page Pith review arXiv 2017
-
[4]
Mastering the game of go with deep neural networks and tree search,
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershelvam, M. Lanctot,et al., “Mastering the game of go with deep neural networks and tree search,”nature, vol. 529, no. 7587, pp. 484–489, 2016
2016
-
[5]
Human-level control through deep reinforcement learning,
V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015
2015
-
[6]
Model-Ensemble Trust-Region Policy Optimization
T. Kurutach, I. Clavera, Y . Duan, A. Tamar, and P. Abbeel, “Model-ensemble trust-region policy optimization,”arXiv preprint arXiv:1802.10592, 2018
work page Pith review arXiv 2018
-
[7]
Transformers are sample efficient world models.arXiv preprint arXiv:2209.00588, 2022
V . Micheli, E. Alonso, and F. Fleuret, “Transformers are sample- efficient world models,”arXiv preprint arXiv:2209.00588, 2022
-
[8]
J. Robine, M. H ¨oftmann, T. Uelwer, and S. Harmeling, “Transformer- based world models are happy with 100k interactions,”arXiv preprint arXiv:2303.07109, 2023
-
[9]
arXiv preprint arXiv:2305.10912 , year=
I. Schubert, J. Zhang, J. Bruce, S. Bechtle, E. Parisotto, M. Riedmiller, J. T. Springenberg, A. Byravan, L. Hasenclever, and N. Heess, “A gen- eralist dynamics model for control,”arXiv preprint arXiv:2305.10912, 2023
-
[10]
Mastering Diverse Domains through World Models
D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering diverse domains through world models,”arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review arXiv 2023
-
[11]
Planning with Diffusion for Flexible Behavior Synthesis
M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine, “Plan- ning with diffusion for flexible behavior synthesis,”arXiv preprint arXiv:2205.09991, 2022
work page internal anchor Pith review arXiv 2022
-
[12]
World models via policy-guided trajectory diffusion,
M. Rigter, J. Yamada, and I. Posner, “World models via policy-guided trajectory diffusion,”arXiv preprint arXiv:2312.08533, 2023
-
[13]
Policy-guided diffusion.arXiv preprint arXiv:2404.06356, 2024
M. T. Jackson, M. T. Matthews, C. Lu, B. Ellis, S. Whiteson, and J. Fo- erster, “Policy-guided diffusion,”arXiv preprint arXiv:2404.06356, 2024
-
[14]
Adversarial diffusion for ro- bust reinforcement learning,
D. Foffano, A. Russo, and A. Proutiere, “Adversarial diffusion for ro- bust reinforcement learning,”arXiv preprint arXiv:2509.23846, 2025
-
[15]
Neural network dynamics for model-based deep reinforcement learning with model- free fine-tuning,
A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model- free fine-tuning,” in2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566, IEEE, 2018
2018
-
[16]
Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,
K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep rein- forcement learning in a handful of trials using probabilistic dynamics models,”Advances in neural information processing systems, vol. 31, 2018
2018
-
[17]
Model-Based Reinforcement Learning for Atari
L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Camp- bell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al., “Model-based reinforcement learning for Atari,”arXiv preprint arXiv:1903.00374, 2019
-
[18]
Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models
T. Jafferjee, E. Imani, E. Talvitie, M. White, and M. Bowling, “Hallucinating value: A pitfall of dyna-style planning with imperfect environment models,”arXiv preprint arXiv:2006.04363, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[19]
Plannable approximations to mdp homomorphisms: Equivariance under actions,
E. van der Pol, T. Kipf, F. A. Oliehoek, and M. Welling, “Plannable approximations to mdp homomorphisms: Equivariance under actions,” arXiv preprint arXiv:2002.11963, 2020
-
[20]
Auto-Encoding Variational Bayes
D. P. Kingma, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[21]
Attention is all you need,
A. Vaswani, “Attention is all you need,”Advances in Neural Informa- tion Processing Systems, 2017
2017
-
[22]
Mastering Atari with Discrete World Models
D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba, “Mastering Atari with discrete world models,”arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review arXiv 2010
-
[23]
Recurrent world models facilitate pol- icy evolution,
D. Ha and J. Schmidhuber, “Recurrent world models facilitate pol- icy evolution,”Advances in neural information processing systems, vol. 31, 2018
2018
-
[24]
C. Xiao, Y . Wu, C. Ma, D. Schuurmans, and M. M ¨uller, “Learning to combat compounding-error in model-based reinforcement learning,” arXiv preprint arXiv:1912.11206, 2019
-
[25]
Combating the Compounding-Error Problem with a Multi-step Model
K. Asadi, D. Misra, S. Kim, and M. L. Littman, “Combating the compounding-error problem with a multi-step model,”arXiv preprint arXiv:1905.13320, 2019
work page Pith review arXiv 1905
-
[26]
Deep unsupervised learning using nonequilibrium thermodynamics,
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inInternational conference on machine learning, pp. 2256–2265, PMLR, 2015
2015
-
[27]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020
2020
-
[28]
Diffusion models beat gans on image synthesis,
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021
2021
-
[29]
Classifier-Free Diffusion Guidance
J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Is Conditional Generative Modeling all you need for Decision-Making?
A. Ajay, Y . Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal, “Is conditional generative modeling all you need for decision- making?,”arXiv preprint arXiv:2211.15657, 2022
work page internal anchor Pith review arXiv 2022
-
[31]
Z. Zhu, M. Liu, L. Mao, B. Kang, M. Xu, Y . Yu, S. Ermon, and W. Zhang, “Madiff: Offline multi-agent learning with diffusion models,”arXiv preprint arXiv:2305.17330, 2023
-
[32]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, p. 02783649241273668, 2023
2023
-
[33]
Crossway diffu- sion: Improving diffusion-based visuomotor policy via self-supervised learning,
X. Li, V . Belagali, J. Shang, and M. S. Ryoo, “Crossway diffu- sion: Improving diffusion-based visuomotor policy via self-supervised learning,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 16841–16849, IEEE, 2024
2024
-
[34]
Z. Liang, Y . Mu, M. Ding, F. Ni, M. Tomizuka, and P. Luo, “Adapt- diffuser: Diffusion models as adaptive self-evolving planners,”arXiv preprint arXiv:2302.01877, 2023
-
[35]
Benchmarking Model-Based Reinforcement Learning
T. Wang, X. Bao, I. Clavera, J. Hoang, Y . Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba, “Benchmarking model-based rein- forcement learning,”arXiv preprint arXiv:1907.02057, 2019
work page Pith review arXiv 1907
-
[36]
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine, “IDQL: Implicit q-learning as an actor-critic method with diffusion policies,”arXiv preprint arXiv:2304.10573, 2023
work page internal anchor Pith review arXiv 2023
-
[37]
Value function estimation using conditional diffusion models for control,
B. Mazoure, W. Talbott, M. A. Bautista, D. Hjelm, A. Toshev, and J. Susskind, “Value function estimation using conditional diffusion models for control,”arXiv preprint arXiv:2306.07290, 2023
-
[38]
Dyna, an integrated architecture for learning, planning, and reacting,
R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,”ACM Sigart Bulletin, vol. 2, no. 4, pp. 160–163, 1991
1991
-
[39]
Trust Region Policy Optimization
J. Schulman, “Trust region policy optimization,”arXiv preprint arXiv:1502.05477, 2015
work page Pith review arXiv 2015
-
[40]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
Y . Zheng, J. Li, D. Yu, Y . Yang, S. E. Li, X. Zhan, and J. Liu, “Safe offline reinforcement learning with feasibility-guided diffusion model,” arXiv preprint arXiv:2401.10700, 2024
-
[42]
Diffusion spectral representation for reinforcement learning,
D. Shribak, C.-X. Gao, Y . Li, C. Xiao, and B. Dai, “Diffusion spectral representation for reinforcement learning,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 110028–110056, 2024
2024
-
[43]
Prior-guided diffu- sion planning for offline reinforcement learning,
D. Ki, J. Oh, S.-W. Shim, and B.-J. Lee, “Prior-guided diffu- sion planning for offline reinforcement learning,”arXiv preprint arXiv:2505.10881, 2025
-
[44]
Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025
H. Ma, T. Chen, K. Wang, N. Li, and B. Dai, “Efficient on- line reinforcement learning for diffusion policy,”arXiv preprint arXiv:2502.00361, 2025
-
[45]
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
S. Levine, “Reinforcement learning and control as probabilistic infer- ence: Tutorial and review,”arXiv preprint arXiv:1805.00909, 2018
work page internal anchor Pith review arXiv 2018
-
[46]
R. S. Sutton, A. G. Barto,et al.,Reinforcement learning: An intro- duction. No. 1, MIT press Cambridge, 1998
1998
-
[47]
Asynchronous methods for deep rein- forcement learning,
V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- forcement learning,” inInternational conference on machine learning, pp. 1928–1937, PmLR, 2016
1928
-
[48]
Repaint: Inpainting using denoising diffusion prob- abilistic models,
A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion prob- abilistic models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11461–11471, 2022
2022
-
[49]
Stable-baselines3: Reliable reinforcement learning implemen- tations,
A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implemen- tations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021
2021
-
[50]
High-Resolution Image Synthesis with Latent Diffusion Models
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models. arxiv 2022,”arXiv preprint arXiv:2112.10752, 2021
work page Pith review arXiv 2022
-
[51]
Flow Matching for Generative Modeling
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.