Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

Da Li; Liang Bai; Tianyuan Yu; Yanming Guo; Yanwei Fu; Ye Shi; Yuehu Gong; Yulin Chen; Zeyuan Wang

arxiv: 2605.21282 · v2 · pith:NO7H3SGJnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

Zeyuan Wang , Da Li , Yulin Chen , Yuehu Gong , Yanming Guo , Ye Shi , Liang Bai , Tianyuan Yu

show 1 more author

Yanwei Fu

This is my paper

Pith reviewed 2026-05-22 09:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords MeanFlow policiesoff-policy reinforcement learningmirror descententropy regularizationgenerative policiescontinuous controlMuJoCo benchmarksone-step sampling

0 comments

The pith

Stochastic MeanFlow Policies map Gaussian noise to actions in one step for multimodal off-policy RL with tractable entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Stochastic MeanFlow Policies to solve a core tension in off-policy reinforcement learning: how to combine expressive multimodal action distributions with fast single-step sampling and stable updates. Gaussian policies allow quick entropy calculations but cannot represent multiple good actions well, while many generative approaches require multiple sampling steps or lose reliable entropy estimates. By reparameterizing actions from Gaussian noise through a MeanFlow transformation, the new class supplies an accurate entropy surrogate that fits inside an off-policy mirror descent framework. This produces a single objective balancing entropy-driven exploration against regularization toward the prior policy. Results on seven MuJoCo tasks show consistent gains over both Gaussian and alternative generative baselines while preserving inference speed.

Core claim

Stochastic MeanFlow Policies map Gaussian noise to actions through a MeanFlow transformation. This stochastic reparameterisation yields a tractable entropy surrogate and allows MeanFlow policies to be trained within off-policy mirror descent under a unified objective for exploratory yet stable improvement. Across seven MuJoCo benchmarks, SMFP improves over Gaussian and generative baselines while retaining single-step inference efficiency.

What carries the argument

MeanFlow transformation: a one-pass stochastic mapping from Gaussian noise to actions that supplies a usable entropy surrogate inside the mirror-descent update.

If this is right

A single objective can now enforce both exploration via entropy and stability via previous-policy regularization.
Policy classes no longer need to trade off multimodality against single-step sampling speed.
Off-policy mirror descent becomes directly compatible with generative policies that have tractable entropy.
Performance improvements appear across standard continuous-control benchmarks without extra sampling cost at deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same one-step noise-to-action idea could be tested in settings where sampling speed matters more than in MuJoCo, such as real-time robotics or large-scale planning.
If the entropy surrogate remains reliable, similar transformations might reduce the need for separate entropy-coefficient tuning in other RL algorithms.
Extending the MeanFlow construction to discrete or hybrid action spaces would test whether the approach generalizes beyond continuous control.

Load-bearing premise

The MeanFlow policy class can match the multimodal target created by entropy regularization plus the mirror-descent constraint closely enough that the entropy surrogate stays accurate and does not bias the performance gains.

What would settle it

Train SMFP on an environment whose optimal policy requires clearly separated action modes; if the learned policy collapses to a single mode or the reported gains over Gaussian policies disappear, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.21282 by Da Li, Liang Bai, Tianyuan Yu, Yanming Guo, Yanwei Fu, Ye Shi, Yuehu Gong, Yulin Chen, Zeyuan Wang.

**Figure 1.** Figure 1: An illustration of SMFP Performance vs. Inference Speed. Evaluated on an NVIDIA RTX 5090 across Ant-v4, SMFP achieves strong performance with negligible inference latency. unimodal action distributions. In contrast, diffusion and flow-based policies [6, 12, 14, 21, 36, 60] offer substantially greater expressivity and can model complex multimodal behaviours. However, these benefits come at the cost of intr… view at source ↗

**Figure 2.** Figure 2: Illustration of the composite SACMD target and induced multi-modal structure. Entropy-regularised objectives, such as SAC, require a tractable estimate of policy entropy. However, for expressive generative policies, exact likelihoods and entropies are generally intractable and often rely on costly estimators or trajectory-level approximations [6, 23, 59]. In contrast, diagonal Gaussian parameterisations… view at source ↗

**Figure 3.** Figure 3: Comparison of policy parameterisations and action-distribution evolution on Push-T. (a) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of evaluation performance across 7 benchmarks. All methods are trained for [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis of the entropy temperature parameter [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Considered environments. All these environments are from the mujoco gym bench [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often require iterative sampling or lack tractable entropy estimates. On the optimisation side, SAC-style soft policy improvement and mirror descent (MD) can be viewed as minimising different KL divergences: the former moves the policy towards a value-induced Boltzmann distribution, while the latter regularises each update against the previous policy. Combining entropy regularisation with an MD constraint is therefore attractive, as it supports exploration while stabilising policy improvement; however, the resulting target can be multimodal and is poorly matched by unimodal Gaussian policies. We propose Stochastic MeanFlow Policies (SMFP), a one-step generative policy class that maps Gaussian noise to actions through a MeanFlow transformation. This stochastic reparameterisation yields a tractable entropy surrogate and allows MeanFlow policies to be trained within off-policy mirror descent under a unified objective for exploratory yet stable improvement. Across seven MuJoCo benchmarks, SMFP improves over Gaussian and generative baselines while retaining single-step inference efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMFP gives a practical one-step generative policy for off-policy mirror descent with MuJoCo gains, but the entropy surrogate's bias under the multimodal target is the part that still needs checking.

read the letter

Colleague, this paper introduces Stochastic MeanFlow Policies as a way to get multimodal action distributions in one forward pass for off-policy RL with mirror descent. The main pitch is that it gives a tractable entropy estimate so you can regularize for exploration while using MD to stabilize updates against the previous policy. They handle the problem setup cleanly. Standard Gaussians can't capture multiple modes well, and full generative models often need iterative sampling or don't give easy entropy. The MeanFlow reparameterization looks like a direct response to that. Training under the combined objective is a reasonable move, and the results on seven MuJoCo environments show better performance than the Gaussian and generative baselines they compare to. The experiments are the part that carries the most weight here. If the improvements hold with the right ablations and error bars, it's a practical advance for people who need expressive policies without slowing down inference. The main question mark is the entropy surrogate. The concern about bias in the approximation when the target distribution gets multimodal is worth checking. If the surrogate doesn't track the true entropy closely enough under the MD constraint, the learning could drift from the intended behavior. The paper would be stronger with some analysis or empirical check on how good the approximation actually is. This is the kind of work that fits in a reading group on modern RL algorithms for continuous control. It has a clear motivation and some evidence it works, so a reader in that area could pick up a useful trick or two. I'd recommend sending it out for peer review. The contribution is focused enough that referees can give targeted feedback on the surrogate and the experiments.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Stochastic MeanFlow Policies (SMFP), a one-step generative policy class that maps Gaussian noise to actions via a MeanFlow transformation. This stochastic reparameterization is claimed to produce a tractable entropy surrogate, enabling training of the policies within an off-policy mirror descent framework under a unified objective that combines entropy regularization (for exploration) with the mirror descent constraint (for stability). The paper asserts that this yields exploratory yet stable improvement and reports empirical gains over Gaussian and generative baselines across seven MuJoCo benchmarks while preserving single-step inference efficiency.

Significance. If the entropy surrogate is shown to be sufficiently accurate and unbiased under the multimodal target induced by the combined objective, and if the empirical gains are reproducible with proper controls, the work could provide a practical bridge between expressive generative policies and the tractability requirements of off-policy RL. It would address a recurring tension in continuous control by allowing multimodal action distributions without iterative sampling or loss of entropy estimates.

major comments (2)

Abstract: the central claim that the stochastic reparameterization 'yields a tractable entropy surrogate' supporting stable off-policy mirror descent is load-bearing, yet the abstract supplies no explicit form of the surrogate, derivation, or error bound relative to the true entropy on the multimodal target created by entropy regularization plus the MD constraint; without this, it is impossible to assess whether bias in the surrogate could shift the fixed point of the unified objective away from the intended one.
Empirical evaluation (referenced in abstract): the reported improvements on seven MuJoCo benchmarks are presented without error bars, ablation studies isolating the entropy surrogate, or controls for the MeanFlow transformation itself; this makes it difficult to attribute gains specifically to the proposed method rather than implementation details or baseline tuning.

minor comments (1)

Abstract: the description of the MeanFlow transformation could be expanded with one additional sentence defining the map to improve accessibility for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the central claim that the stochastic reparameterization 'yields a tractable entropy surrogate' supporting stable off-policy mirror descent is load-bearing, yet the abstract supplies no explicit form of the surrogate, derivation, or error bound relative to the true entropy on the multimodal target created by entropy regularization plus the MD constraint; without this, it is impossible to assess whether bias in the surrogate could shift the fixed point of the unified objective away from the intended one.

Authors: The derivation of the entropy surrogate appears in Section 3.2, where the MeanFlow transformation is applied to standard Gaussian noise and the change-of-variables formula yields an exact, closed-form entropy for the resulting policy. Because the surrogate matches the entropy of the policy class exactly, it does not introduce bias that would alter the fixed point of the combined objective; the mirror-descent constraint is enforced on the policy parameters independently of the entropy term. We agree the abstract is overly terse on this point and will revise it to state the surrogate form and point to the derivation. revision: yes
Referee: Empirical evaluation (referenced in abstract): the reported improvements on seven MuJoCo benchmarks are presented without error bars, ablation studies isolating the entropy surrogate, or controls for the MeanFlow transformation itself; this makes it difficult to attribute gains specifically to the proposed method rather than implementation details or baseline tuning.

Authors: We accept that the current presentation lacks sufficient controls. In the revision we will report mean performance with standard-deviation error bars over ten independent random seeds, add an ablation that removes the entropy-surrogate term while keeping the MeanFlow policy class, and include a control that applies the MeanFlow transform to a standard Gaussian policy. These additions will allow clearer attribution of the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces Stochastic MeanFlow Policies as a new one-step generative policy class whose stochastic reparameterization is defined to produce a tractable entropy surrogate, which is then used inside an off-policy mirror-descent objective. This construction is presented directly from the policy definition and the choice of MeanFlow map rather than by fitting a parameter to the target quantity or by reducing to a prior self-citation. The central performance claims are supported by external MuJoCo benchmark comparisons rather than by any internal renaming or self-referential fitting. No load-bearing equation or step in the abstract or described derivation reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the MeanFlow transformation itself may hide fitted components or modeling assumptions not visible here.

pith-pipeline@v0.9.0 · 5760 in / 1181 out tokens · 40341 ms · 2026-05-22T09:48:58.798269+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This stochastic reparameterisation yields a tractable entropy surrogate ... hinge-style entropy regulariser ... advantage-weighted MeanFlow regression objective
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

unified objective for exploratory yet stable improvement ... off-policy mirror descent

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 7 internal anchors

[1]

Maximum a posteriori policy optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[2]

Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

work page 2003
[3]

A distributional perspective on reinforce- ment learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning. InInternational conference on machine learning (ICML), 2017

work page 2017
[4]

Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[5]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Dime: Diffusion-based maximum entropy reinforcement learning

Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. Dime: Diffusion-based maximum entropy reinforcement learning. In International conference on machine learning (ICML), 2025

work page 2025
[7]

Simple hi- erarchical planning with diffusion

Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, and Sungjin Ahn. Simple hi- erarchical planning with diffusion. InThe Twelfth International Conference on Learning Representations(ICLR), 2024

work page 2024
[8]

One-step flow policy mirror descent

Tianyi Chen, Haitong Ma, Na Li, Kai Wang, and Bo Dai. One-step flow policy mirror descent. arXiv preprint arXiv:2507.23675, 2025

work page arXiv 2025
[9]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 2023

work page 2023
[10]

How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments

Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How many random seeds? statistical power analysis in deep reinforcement learning experiments.arXiv preprint arXiv:1806.08295, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Twin-delayed ddpg: A deep reinforcement learning technique to model a continuous movement of an intelligent robot agent

Stephen Dankwa and Wenfeng Zheng. Twin-delayed ddpg: A deep reinforcement learning technique to model a continuous movement of an intelligent robot agent. InProceedings of the 3rd international conference on vision, image and signal processing, 2019

work page 2019
[12]

Diffusion-based reinforcement learning via q-weighted variational policy optimization

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[13]

Consistency models as a rich and efficient policy class for reinforcement learning

Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024. 10

work page 2024
[14]

Maximum entropy reinforcement learning with diffusion policy

Xiaoyi Dong, Jian Cheng, and Xi Sheryl Zhang. Maximum entropy reinforcement learning with diffusion policy. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[15]

Scaling offline rl via efficient and expressive shortcut models

Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kianté Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[16]

Mean flows for one-step generative modeling.Advances in Neural Information Processing Systems (NeurIPS), 2025

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[17]

Ffjord: Free-form continuous dynamics for scalable reversible generative models

Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[18]

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017

work page 2017
[19]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning (ICML), 2017

work page 2017
[20]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning (ICML), 2018

work page 2018
[21]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[23]

A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines.Communications in Statistics-Simulation and Computation, 18(3):1059– 1076, 1989

Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines.Communications in Statistics-Simulation and Computation, 18(3):1059– 1076, 1989

work page 1989
[24]

Batch renormalization: Towards reducing minibatch dependence in batch- normalized models.Advances in Neural Information Processing Systems (NeurIPS), 2017

Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch- normalized models.Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[25]

Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning

Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, and Doina Precup. Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning. InInternational Conference on Learning Representations(ICLR), 2025

work page 2025
[26]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[27]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InInternational Conference on Learning Representations (ICLR), 2014

work page 2014
[28]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: tutorial and review.arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[31]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. 11

work page 2023
[32]

Flow-based policy for online reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2025

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[33]

Flac: Maximum entropy rl via kinetic energy regularized bridge matching.arXiv preprint arXiv:2602.12829, 2026

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, and Xiao Ma. Flac: Maximum entropy rl via kinetic energy regularized bridge matching.arXiv preprint arXiv:2602.12829, 2026

work page arXiv 2026
[34]

Efficient online reinforcement learning for diffusion policy

Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy. InForty-second International Conference on Machine Learning (ICML), 2025

work page 2025
[35]

Walk these ways: Tuning robot control for gener- alization with multiplicity of behavior

Gabriel B Margolis and Pulkit Agrawal. Walk these ways: Tuning robot control for gener- alization with multiplicity of behavior. InAnnual Conference on Robot Learning. PMLR, 2023

work page 2023
[36]

Octo: An open-source generalist robot policy

Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA, 2024

work page 2024
[37]

Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miło´s, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[38]

A unified view of entropy-regularized markov decision processes.International Conference on Machine Learning (ICML), 2019

Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov decision processes.International Conference on Machine Learning (ICML), 2019

work page 2019
[39]

Investigating the utility of mirror descent in off-policy actor-critic

Samuel Neumann, Jiamin He, Adam White, and Martha White. Investigating the utility of mirror descent in off-policy actor-critic. InReinforcement Learning Conference (RLC), 2025

work page 2025
[40]

Much ado about noising: Dispelling the myths of generative robotic control

Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Permenter, Guannan Qu, Nicholas Boffi, Guanya Shi, et al. Much ado about noising: Dispelling the myths of generative robotic control. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[41]

Flow q-learning

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. InInternational conference on machine learning (ICML), 2025

work page 2025
[42]

Empirical design in reinforcement learning.Journal of Machine Learning Research (JMLR), 2024

Andrew Patterson, Samuel Neumann, Martha White, and Adam White. Empirical design in reinforcement learning.Journal of Machine Learning Research (JMLR), 2024

work page 2024
[43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 4195–4205, 2023

work page 2023
[44]

Learning a diffusion model policy from rewards via q-score matching

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024
[45]

The information geometry of mirror descent.IEEE Transactions on Information Theory, 61(3):1451–1457, 2015

Garvesh Raskutti and Sayan Mukherjee. The information geometry of mirror descent.IEEE Transactions on Information Theory, 61(3):1451–1457, 2015

work page 2015
[46]

Stochastic backpropagation and approximate inference in deep generative models

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InInternational Conference on Machine Learning (ICML), 2014

work page 2014
[47]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021. 12

work page 2021
[50]

Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 2025

Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter Stone. Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 2025

work page 2025
[51]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012

work page 2012
[52]

Mirror descent policy optimization

Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[53]

Leverage the average: an analysis of kl regularization in reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2020

Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. Leverage the average: an analysis of kl regularization in reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[54]

Divergence-augmented policy opti- mization.Advances in Neural Information Processing Systems (NeurIPS), 2019

Qing Wang, Yingru Li, Jiechao Xiong, and Tong Zhang. Divergence-augmented policy opti- mization.Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[55]

Diffusion actor-critic with entropy regulator

Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, et al. Diffusion actor-critic with entropy regulator. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[56]

One-step generative policies with q-learning: A reformulation of meanflow

Zeyuan Wang, Da Li, Yulin Chen, Ye Shi, Liang Bai, Tianyuan Yu, and Yanwei Fu. One-step generative policies with q-learning: A reformulation of meanflow. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026
[57]

Policy representation via diffusion probability model for reinforcement learning

Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning.arXiv preprint arXiv:2305.13122, 2023

work page arXiv 2023
[58]

Simple and effective stochastic neural networks

Tianyuan Yu, Yongxin Yang, Da Li, Timothy Hospedales, and Tao Xiang. Simple and effective stochastic neural networks. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021

work page 2021
[59]

Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2025

Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[60]

dX i=1 logσ (i) θ (at, b, t) # + d 2 log(2πe) | {z } const. ⇒ ˜H(πθ |e) =E e

Yixian Zhang, Shu’ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, and Wenbo Ding. Sac flow: Sample-efficient reinforcement learning of flow-based policies via velocity-reparameterized sequential modeling.International Conference on Learn- ing Representations(ICLR), 2026. 13 A Limitations and Future Work Although the hybrid optim...

work page 2026
[61]

The rectification ensures that sub-optimal actions (where A <0 ) are filtered out, focusing the generative modelling solely on the improving regions of the action space

Monotonicity.Higher advantage samples receive strictly higher weights, preserving the rank-based preference for high-value regions. The rectification ensures that sub-optimal actions (where A <0 ) are filtered out, focusing the generative modelling solely on the improving regions of the action space

work page
[62]

In practice, unbounded Q-values can cause exp(A/λ) to explode, resulting in numerical overflow and unstable gradients

Numerical Stability and Implicit Gradient Clipping.The exponential function is highly sensitive to the scale of Q-values. In practice, unbounded Q-values can cause exp(A/λ) to explode, resulting in numerical overflow and unstable gradients. The truncated linear approximation(Q−V) + naturally bounds the weights and acts as an implicit gradient clipper. Thi...

work page
[63]

A fixedλ can lead to vanishing gradients in high-reward environments or uniform weights in low-reward ones

Scale Invariance and Hyperparameter Robustness.In standard exponential weighting exp(A/λ), the regularisation coefficient λ must be carefully tuned for each environment, as the scale of return (and thus the Q-values) varies significantly across different tasks. A fixedλ can lead to vanishing gradients in high-reward environments or uniform weights in low-...

work page

[1] [1]

Maximum a posteriori policy optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[2] [2]

Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

work page 2003

[3] [3]

A distributional perspective on reinforce- ment learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning. InInternational conference on machine learning (ICML), 2017

work page 2017

[4] [4]

Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[5] [5]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Dime: Diffusion-based maximum entropy reinforcement learning

Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. Dime: Diffusion-based maximum entropy reinforcement learning. In International conference on machine learning (ICML), 2025

work page 2025

[7] [7]

Simple hi- erarchical planning with diffusion

Chang Chen, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, and Sungjin Ahn. Simple hi- erarchical planning with diffusion. InThe Twelfth International Conference on Learning Representations(ICLR), 2024

work page 2024

[8] [8]

One-step flow policy mirror descent

Tianyi Chen, Haitong Ma, Na Li, Kai Wang, and Bo Dai. One-step flow policy mirror descent. arXiv preprint arXiv:2507.23675, 2025

work page arXiv 2025

[9] [9]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 2023

work page 2023

[10] [10]

How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments

Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How many random seeds? statistical power analysis in deep reinforcement learning experiments.arXiv preprint arXiv:1806.08295, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Twin-delayed ddpg: A deep reinforcement learning technique to model a continuous movement of an intelligent robot agent

Stephen Dankwa and Wenfeng Zheng. Twin-delayed ddpg: A deep reinforcement learning technique to model a continuous movement of an intelligent robot agent. InProceedings of the 3rd international conference on vision, image and signal processing, 2019

work page 2019

[12] [12]

Diffusion-based reinforcement learning via q-weighted variational policy optimization

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[13] [13]

Consistency models as a rich and efficient policy class for reinforcement learning

Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2024. 10

work page 2024

[14] [14]

Maximum entropy reinforcement learning with diffusion policy

Xiaoyi Dong, Jian Cheng, and Xi Sheryl Zhang. Maximum entropy reinforcement learning with diffusion policy. InInternational Conference on Machine Learning (ICML), 2025

work page 2025

[15] [15]

Scaling offline rl via efficient and expressive shortcut models

Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kianté Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[16] [16]

Mean flows for one-step generative modeling.Advances in Neural Information Processing Systems (NeurIPS), 2025

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[17] [17]

Ffjord: Free-form continuous dynamics for scalable reversible generative models

Will Grathwohl, Ricky TQ Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[18] [18]

Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates

Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017

work page 2017

[19] [19]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning (ICML), 2017

work page 2017

[20] [20]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning (ICML), 2018

work page 2018

[21] [21]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[23] [23]

A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines.Communications in Statistics-Simulation and Computation, 18(3):1059– 1076, 1989

Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines.Communications in Statistics-Simulation and Computation, 18(3):1059– 1076, 1989

work page 1989

[24] [24]

Batch renormalization: Towards reducing minibatch dependence in batch- normalized models.Advances in Neural Information Processing Systems (NeurIPS), 2017

Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch- normalized models.Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[25] [25]

Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning

Haque Ishfaq, Guangyuan Wang, Sami Nur Islam, and Doina Precup. Langevin soft actor-critic: Efficient exploration through uncertainty-driven critic learning. InInternational Conference on Learning Representations(ICLR), 2025

work page 2025

[26] [26]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[27] [27]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InInternational Conference on Learning Representations (ICLR), 2014

work page 2014

[28] [28]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: tutorial and review.arXiv preprint arXiv:1805.00909, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[31] [31]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. 11

work page 2023

[32] [32]

Flow-based policy for online reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2025

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[33] [33]

Flac: Maximum entropy rl via kinetic energy regularized bridge matching.arXiv preprint arXiv:2602.12829, 2026

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, and Xiao Ma. Flac: Maximum entropy rl via kinetic energy regularized bridge matching.arXiv preprint arXiv:2602.12829, 2026

work page arXiv 2026

[34] [34]

Efficient online reinforcement learning for diffusion policy

Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy. InForty-second International Conference on Machine Learning (ICML), 2025

work page 2025

[35] [35]

Walk these ways: Tuning robot control for gener- alization with multiplicity of behavior

Gabriel B Margolis and Pulkit Agrawal. Walk these ways: Tuning robot control for gener- alization with multiplicity of behavior. InAnnual Conference on Robot Learning. PMLR, 2023

work page 2023

[36] [36]

Octo: An open-source generalist robot policy

Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA, 2024

work page 2024

[37] [37]

Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miło´s, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[38] [38]

A unified view of entropy-regularized markov decision processes.International Conference on Machine Learning (ICML), 2019

Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov decision processes.International Conference on Machine Learning (ICML), 2019

work page 2019

[39] [39]

Investigating the utility of mirror descent in off-policy actor-critic

Samuel Neumann, Jiamin He, Adam White, and Martha White. Investigating the utility of mirror descent in off-policy actor-critic. InReinforcement Learning Conference (RLC), 2025

work page 2025

[40] [40]

Much ado about noising: Dispelling the myths of generative robotic control

Chaoyi Pan, Giri Anantharaman, Nai-Chieh Huang, Claire Jin, Daniel Pfrommer, Chenyang Yuan, Frank Permenter, Guannan Qu, Nicholas Boffi, Guanya Shi, et al. Much ado about noising: Dispelling the myths of generative robotic control. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[41] [41]

Flow q-learning

Seohong Park, Qiyang Li, and Sergey Levine. Flow q-learning. InInternational conference on machine learning (ICML), 2025

work page 2025

[42] [42]

Empirical design in reinforcement learning.Journal of Machine Learning Research (JMLR), 2024

Andrew Patterson, Samuel Neumann, Martha White, and Adam White. Empirical design in reinforcement learning.Journal of Machine Learning Research (JMLR), 2024

work page 2024

[43] [43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 4195–4205, 2023

work page 2023

[44] [44]

Learning a diffusion model policy from rewards via q-score matching

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

work page 2024

[45] [45]

The information geometry of mirror descent.IEEE Transactions on Information Theory, 61(3):1451–1457, 2015

Garvesh Raskutti and Sayan Mukherjee. The information geometry of mirror descent.IEEE Transactions on Information Theory, 61(3):1451–1457, 2015

work page 2015

[46] [46]

Stochastic backpropagation and approximate inference in deep generative models

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InInternational Conference on Machine Learning (ICML), 2014

work page 2014

[47] [47]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015

[48] [48]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[49] [49]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021. 12

work page 2021

[50] [50]

Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 2025

Chen Tang, Ben Abbatematteo, Jiaheng Hu, Rohan Chandra, Roberto Martín-Martín, and Peter Stone. Deep reinforcement learning for robotics: A survey of real-world successes.Annual Review of Control, Robotics, and Autonomous Systems, 2025

work page 2025

[51] [51]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012

work page 2012

[52] [52]

Mirror descent policy optimization

Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022

[53] [53]

Leverage the average: an analysis of kl regularization in reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2020

Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. Leverage the average: an analysis of kl regularization in reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[54] [54]

Divergence-augmented policy opti- mization.Advances in Neural Information Processing Systems (NeurIPS), 2019

Qing Wang, Yingru Li, Jiechao Xiong, and Tong Zhang. Divergence-augmented policy opti- mization.Advances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[55] [55]

Diffusion actor-critic with entropy regulator

Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, et al. Diffusion actor-critic with entropy regulator. Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[56] [56]

One-step generative policies with q-learning: A reformulation of meanflow

Zeyuan Wang, Da Li, Yulin Chen, Ye Shi, Liang Bai, Tianyuan Yu, and Yanwei Fu. One-step generative policies with q-learning: A reformulation of meanflow. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

work page 2026

[57] [57]

Policy representation via diffusion probability model for reinforcement learning

Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning.arXiv preprint arXiv:2305.13122, 2023

work page arXiv 2023

[58] [58]

Simple and effective stochastic neural networks

Tianyuan Yu, Yongxin Yang, Da Li, Timothy Hospedales, and Tao Xiang. Simple and effective stochastic neural networks. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021

work page 2021

[59] [59]

Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2025

Tonghe Zhang, Chao Yu, Sichang Su, and Yu Wang. Reinflow: Fine-tuning flow matching policy with online reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[60] [60]

dX i=1 logσ (i) θ (at, b, t) # + d 2 log(2πe) | {z } const. ⇒ ˜H(πθ |e) =E e

Yixian Zhang, Shu’ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, and Wenbo Ding. Sac flow: Sample-efficient reinforcement learning of flow-based policies via velocity-reparameterized sequential modeling.International Conference on Learn- ing Representations(ICLR), 2026. 13 A Limitations and Future Work Although the hybrid optim...

work page 2026

[61] [61]

The rectification ensures that sub-optimal actions (where A <0 ) are filtered out, focusing the generative modelling solely on the improving regions of the action space

Monotonicity.Higher advantage samples receive strictly higher weights, preserving the rank-based preference for high-value regions. The rectification ensures that sub-optimal actions (where A <0 ) are filtered out, focusing the generative modelling solely on the improving regions of the action space

work page

[62] [62]

In practice, unbounded Q-values can cause exp(A/λ) to explode, resulting in numerical overflow and unstable gradients

Numerical Stability and Implicit Gradient Clipping.The exponential function is highly sensitive to the scale of Q-values. In practice, unbounded Q-values can cause exp(A/λ) to explode, resulting in numerical overflow and unstable gradients. The truncated linear approximation(Q−V) + naturally bounds the weights and acts as an implicit gradient clipper. Thi...

work page

[63] [63]

A fixedλ can lead to vanishing gradients in high-reward environments or uniform weights in low-reward ones

Scale Invariance and Hyperparameter Robustness.In standard exponential weighting exp(A/λ), the regularisation coefficient λ must be carefully tuned for each environment, as the scale of return (and thus the Q-values) varies significantly across different tasks. A fixedλ can lead to vanishing gradients in high-reward environments or uniform weights in low-...

work page