pith. sign in

arxiv: 2606.29820 · v1 · pith:MOG7SL6Anew · submitted 2026-06-29 · 💻 cs.LG · cs.AI

Dual-Flow Reinforcement Learning with State-Aware Exploration

Pith reviewed 2026-06-30 07:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningcontinuous controlflow matchingmultimodal policyvalue estimationexplorationactor-critic
0
0 comments X

The pith

Dual-Flow RL jointly models multimodal policies and return distributions via conditional flow matching to enable reliable value estimation and sustained exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in continuous-control reinforcement learning, optimal actions are often multimodal while returns are uncertain, so standard unimodal value estimators introduce bias and generative policies tend to collapse. It proposes a single actor-critic architecture that uses conditional flow matching to represent both a full return distribution and a multimodal action distribution at once. An additional Entropy-Covariance Exploration Regulator then modulates exploration on a per-state basis using policy entropy and action covariance. Experiments across DeepMind Control Suite and Humanoid-Bench show this dual-flow design yields state-of-the-art scores on most tasks and materially outperforms earlier diffusion- and flow-based methods.

Core claim

Dual-Flow RL is an actor-critic method that simultaneously learns a continuous return distribution and a multimodal policy distribution by applying conditional flow matching to both, while an Entropy-Covariance Exploration Regulator adjusts exploration intensity according to each state's policy entropy and action-uncertainty covariance; the resulting framework produces unbiased value estimates and prevents premature mode collapse, delivering superior performance on standard continuous-control benchmarks.

What carries the argument

Conditional flow matching applied in parallel to return and policy distributions, paired with the Entropy-Covariance Exploration Regulator that scales exploration using entropy and covariance per state.

If this is right

  • Value estimates remain unbiased even when returns are multimodal, removing a systematic source of error in actor-critic updates.
  • Policy distributions retain multiple distinct modes throughout training instead of collapsing, enabling continued exploration of high-return regions.
  • State-dependent regulation of exploration via entropy and covariance allows the agent to explore more in uncertain states and less in well-understood ones.
  • The same conditional-flow architecture can be dropped into existing actor-critic codebases without requiring separate generative models for policy and value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to offline RL settings where multimodal return distributions are even more pronounced.
  • If the regulator proves robust, it could replace hand-tuned entropy bonuses in many current algorithms.
  • Training two coupled flow models increases compute; future work would need to measure whether the performance gain justifies the extra cost on larger state spaces.

Load-bearing premise

Conditional flow matching will keep producing diverse high-value action modes and unbiased return estimates across the tested control tasks without collapsing or drifting.

What would settle it

Run the same tasks with the dual-flow components replaced by standard Gaussian critics and unimodal policies; if performance drops to levels comparable to prior diffusion or flow methods, the joint-modeling claim is falsified.

Figures

Figures reproduced from arXiv: 2606.29820 by Diange Yang, Kun Jiang, Qijun Li, Qi Song, Weitao Zhou, Yifei He, Zheng Fu.

Figure 1
Figure 1. Figure 1: Comparison of performance and efficiency across [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Ablations. (a–b) Evaluation curves on MuJoCo: We compare Dual-Flow with various diffusion and flow baselines on Humanoid-v3 and Ant-v3. Dual-Flow outperforms most baselines and achieves stronger overall performance. (c–d) Exploration￾strength regulation: Dual-Flow on Humanoid-run with different exploration regulators, including ECER (ours), DACER-style entropy regulator, and fixed initial noise scale tunin… view at source ↗
Figure 3
Figure 3. Figure 3: Visualizing the return distribution. Columns from left to right show the predicted return distributions of Dual-Flow, C51, IQN, and DSAC, followed by the corresponding 1-Wasserstein distance. Dual-Flow better matches the ground-truth distributions and achieves the lowest error. V. EXPERIMENTS A. Experimental Setup Baselines. Our method is compared and evaluated against 13 model-free continuous-control base… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation curves on benchmarks. Solid lines denote the mean and shaded regions indicate the 95% confidence interval. IQN [18], and DSAC [9]. For ground truth, we fix the policy and estimate the distribution by performing repeated rollouts. For fair comparison, we use 5000 return samples and 60 bins for each histogram. These results demonstrate that our method provides a more accurate characterization of t… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies. (a)–(b) Distributional critic vs. expected-value critic: Replacing the expected-value critic with our flow distributional critic leads to faster learning and higher final returns on Dog-trot (a) and Humanoid-run (b). (c)–(d) Sensitivity to flow steps on Dog-run: Learning curves (c) and IQM (d) across different M1 and M2 settings show that the method is robust to the choice of flow-step co… view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis to the base regularization parameter λ0. (a,c) show learning curves on Humanoid-run and Dog￾run for different λ0. (b,d) report the corresponding aggregated final performance (IQM return) with 95% stratified bootstrap confidence intervals. Larger λ0 (e.g., 1, 5) restricts exploration and degrades performance, while smaller values (e.g., 0.05, 0.1) behave similarly. Regularization coeffi… view at source ↗
Figure 7
Figure 7. Figure 7: Visualizations of the considered environments. From the DeepMind Control Suite (DMC) [13], we include ten locomotion tasks: Humanoid (Run/Stand), Dog (Run/Trot/Stand/Walk), Walker (Run/Walk/Stand), and Quadruped (Walk). From H-Bench [14], we include H1-sit hard and H1-balance simple. From MuJoCo Gym [15], we include Humanoid-v3 and Ant-v3. These environments are characterized by high-dimensional state-acti… view at source ↗
read the original abstract

In complex continuous-control reinforcement learning tasks, multimodal optimal actions often coincide with uncertain, multimodal return distributions, making reliable value estimation and multimodal exploration challenging. Existing value estimation methods using unimodal Gaussians restrict expressiveness and yield biased estimates. Recent generative policies can represent multimodal actions but often collapse to a few modes and under-explore high-value areas of the action space. Motivated by these challenges, we propose Dual-Flow RL, a unified actor-critic framework that jointly models a continuous return distribution and a multimodal policy distribution using conditional flow matching (CFM). This design supports reliable value estimation and sustained multimodal exploration. To further enhance exploration, we introduce an Entropy-Covariance Exploration Regulator (ECER) that enables state-aware exploration regulation leveraging policy entropy and action-uncertainty covariance. Experiments on DeepMind Control Suite and Humanoid-Bench show that Dual-Flow RL achieves state-of-the-art performance on most tasks, significantly outperforming prior diffusion-based and flow-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Dual-Flow RL, a unified actor-critic framework for continuous-control RL that jointly models a continuous return distribution and a multimodal policy distribution via conditional flow matching (CFM). It augments this with an Entropy-Covariance Exploration Regulator (ECER) that regulates exploration in a state-aware manner using policy entropy and action-uncertainty covariance. Experiments on DeepMind Control Suite and Humanoid-Bench are reported to show state-of-the-art performance on most tasks, outperforming prior diffusion-based and flow-based methods.

Significance. If the empirical results hold under rigorous evaluation, the work offers a coherent integration of CFM into an actor-critic loop that simultaneously targets reliable value estimation for multimodal returns and sustained multimodal exploration, addressing documented limitations of unimodal critics and mode-collapsing generative policies. The ECER component provides an explicit mechanism for state-dependent regulation that could be reusable beyond this architecture.

major comments (1)
  1. [Experiments] The central empirical claim (SOTA on most tasks) is load-bearing for the contribution, yet the provided text supplies no visible details on experimental protocol, number of random seeds, statistical tests, baseline implementations, or ablation results that would allow verification that post-hoc hyperparameter choices or implementation details do not drive the reported gains.
minor comments (2)
  1. [Method] Notation for the dual CFM objectives and the precise conditioning variables in the joint actor-critic update could be clarified with an explicit equation block early in the method section to aid reproducibility.
  2. [Method] The abstract states that ECER 'enables state-aware exploration regulation,' but a short illustrative example or pseudocode showing how the covariance term modulates the entropy bonus per state would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the recognition of the potential significance of integrating conditional flow matching into an actor-critic framework with state-aware exploration. We address the major comment below.

read point-by-point responses
  1. Referee: [Experiments] The central empirical claim (SOTA on most tasks) is load-bearing for the contribution, yet the provided text supplies no visible details on experimental protocol, number of random seeds, statistical tests, baseline implementations, or ablation results that would allow verification that post-hoc hyperparameter choices or implementation details do not drive the reported gains.

    Authors: We agree that the manuscript as provided does not include sufficient explicit details on the experimental protocol, which limits independent verification of the SOTA claims. In the revised version we will add a dedicated experimental protocol subsection (and expanded appendix) specifying: the number of random seeds (10 seeds per task), reporting conventions (mean and standard deviation across seeds), statistical comparisons (paired t-tests against baselines with p<0.05 thresholds), baseline implementation details (original code repositories, any re-implementations or hyperparameter selections, and training budgets), and full ablation results isolating the contribution of the dual-flow architecture versus ECER. These additions will be placed in the main body where space permits and otherwise in the appendix, directly addressing concerns about post-hoc choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method proposal lacks derivations or self-referential reductions

full rationale

The provided abstract and description contain no equations, derivations, or explicit prediction steps. The paper proposes Dual-Flow RL as a joint CFM-based actor-critic plus ECER regulator, motivated by challenges in multimodal returns and exploration, then reports empirical results on standard benchmarks. No self-definitional constructions, fitted inputs renamed as predictions, or load-bearing self-citations appear. The approach builds on existing conditional flow matching without reducing any claimed result to its own inputs by construction. This is the expected non-finding for a high-level method proposal without mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are specified or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5710 in / 968 out tokens · 28099 ms · 2026-06-30T07:13:37.801309+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    Scalable exploration for high- dimensional continuous control via value-guided flow,

    Y . Wei, C. Zuo, and Y . Sui, “Scalable exploration for high- dimensional continuous control via value-guided flow,”arXiv preprint arXiv:2601.19707, 2026

  2. [2]

    Global–local decomposition of contextual representations in meta-reinforcement learning,

    N. Ma, J. Xuan, G. Zhang, and J. Lu, “Global–local decomposition of contextual representations in meta-reinforcement learning,”IEEE Transactions on Cybernetics, vol. 55, no. 3, pp. 1277–1287, 2025

  3. [3]

    Ekg-ac: A new paradigm for process industrial optimization based on offline reinforce- ment learning with expert knowledge guidance,

    D. Liu, Y . Wang, C. Liu, B. Luo, and B. Huang, “Ekg-ac: A new paradigm for process industrial optimization based on offline reinforce- ment learning with expert knowledge guidance,”IEEE Transactions on Cybernetics, pp. 1–11, 2025

  4. [4]

    Optimal tracking control of uncertain nonlinear systems using simplified reinforcement learning,

    P. Ning, L. Duan, and C. Hua, “Optimal tracking control of uncertain nonlinear systems using simplified reinforcement learning,”IEEE Trans- actions on Cybernetics, vol. 56, no. 6, pp. 3200–3209, 2026

  5. [5]

    Exploring the application of blockchain technology in crowdsource autonomous driving map updat- ing,

    B. Wijaya, M. Yang, K. Jianget al., “Exploring the application of blockchain technology in crowdsource autonomous driving map updat- ing,”Communications in Transportation Research, vol. 4, p. 100140, 2024

  6. [6]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational conference on machine learning. Pmlr, 2018, pp. 1861–1870

  7. [7]

    Addressing function approxi- mation error in actor-critic methods,

    S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi- mation error in actor-critic methods,” inInternational conference on machine learning. PMLR, 2018, pp. 1587–1596

  8. [8]

    Conservative q-learning for offline reinforcement learning,

    A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”Advances in neural information processing systems, vol. 33, pp. 1179–1191, 2020

  9. [9]

    Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,

    J. Duan, Y . Guan, S. E. Li, Y . Ren, Q. Sun, and B. Cheng, “Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,”IEEE transactions on neural networks and learning systems, vol. 33, no. 11, pp. 6584–6598, 2021

  10. [10]

    Distributional reinforcement learning with quantile regression,

    W. Dabney, M. Rowland, M. Bellemare, and R. Munos, “Distributional reinforcement learning with quantile regression,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  11. [11]

    Flowpg: action-constrained policy gradient with normalizing flows,

    J. Brahmanage, J. Ling, and A. Kumar, “Flowpg: action-constrained policy gradient with normalizing flows,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 118–20 132, 2023

  12. [12]

    Diffusion actor-critic with entropy regulator,

    Y . Wang, L. Wang, Y . Jiang, W. Zou, T. Liu, X. Song, W. Wang, L. Xiao, J. Wu, J. Duanet al., “Diffusion actor-critic with entropy regulator,”Advances in Neural Information Processing Systems, vol. 37, pp. 54 183–54 204, 2024

  13. [13]

    DeepMind Control Suite

    Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, 12 A. Abdolmaleki, J. Merel, A. Lefrancqet al., “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018

  14. [14]

    Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation,

    C. Sferrazza, D.-M. Huang, X. Lin, Y . Lee, and P. Abbeel, “Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation,” inRobotics: Science and Systems, 2024

  15. [15]

    OpenAI Gym

    G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016

  16. [16]

    Flow-based pol- icy for online reinforcement learning,

    L. Lv, Y . Li, Y . Luo, F. Sun, T. Kong, J. Xu, and X. Ma, “Flow-based pol- icy for online reinforcement learning,”arXiv preprint arXiv:2506.12811, 2025

  17. [17]

    A distributional perspec- tive on reinforcement learning,

    M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec- tive on reinforcement learning,” inInternational conference on machine learning. PMLR, 2017, pp. 449–458

  18. [18]

    Implicit quantile networks for distributional reinforcement learning,

    W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantile networks for distributional reinforcement learning,” inInternational conference on machine learning. PMLR, 2018, pp. 1096–1105

  19. [19]

    Distributed distributional de- terministic policy gradients,

    G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional de- terministic policy gradients,” inInternational Conference on Learning Representations, 2018

  20. [20]

    Conservative offline distribu- tional reinforcement learning,

    Y . Ma, D. Jayaraman, and O. Bastani, “Conservative offline distribu- tional reinforcement learning,”Advances in neural information process- ing systems, vol. 34, pp. 19 235–19 247, 2021

  21. [21]

    Bellman diffusion: Generative modeling as learning a linear operator in the distribution space,

    Y . Li, C.-H. Lai, C.-B. Sch ¨onlieb, Y . Mitsufuji, and S. Ermon, “Bellman diffusion: Generative modeling as learning a linear operator in the distribution space,”arXiv preprint arXiv:2410.01796, 2024

  22. [22]

    Parameter-Efficient Distributional RL via Normalizing Flows and a Geometry-Aware Cram\'er Surrogate

    R. Kaddah, J. Read, M.-P. Caniet al., “Flow models for unbounded and geometry-aware distributional reinforcement learning,”arXiv preprint arXiv:2505.04310, 2025

  23. [23]

    Off-policy deep reinforcement learning without exploration,

    S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” inInternational conference on machine learning. PMLR, 2019, pp. 2052–2062

  24. [24]

    Offline reinforcement learning with implicit q-learning,

    I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning,” inInternational Conference on Learning Representations, 2022

  25. [25]

    Generative adversarial imitation learning,

    J. Ho and S. Ermon, “Generative adversarial imitation learning,”Ad- vances in neural information processing systems, vol. 29, 2016

  26. [26]

    Diffusion-based reinforcement learning via q-weighted variational pol- icy optimization,

    S. Ding, K. Hu, Z. Zhang, K. Ren, W. Zhang, J. Yu, J. Wang, and Y . Shi, “Diffusion-based reinforcement learning via q-weighted variational pol- icy optimization,”Advances in Neural Information Processing Systems, vol. 37, pp. 53 945–53 968, 2024

  27. [27]

    Energy-weighted flow matching for offline reinforcement learning,

    S. Zhang, W. Zhang, and Q. Gu, “Energy-weighted flow matching for offline reinforcement learning,”arXiv preprint arXiv:2503.04975, 2025

  28. [28]

    Diffusion policies as an expressive policy class for offline reinforcement learning,

    Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expressive policy class for offline reinforcement learning,” inInternational Confer- ence on Learning Representations, 2023

  29. [29]

    Learning a diffusion model policy from rewards via q-score matching,

    M. Psenka, A. Escontrela, P. Abbeel, and Y . Ma, “Learning a diffusion model policy from rewards via q-score matching,” inInternational Conference on Machine Learning, 2024

  30. [30]

    Diffcps: Diffusion model based constrained policy search for offline reinforcement learning,

    L. He, L. Zhang, J. Tan, and X. Wang, “Diffcps: Diffusion model based constrained policy search for offline reinforcement learning,” in International Conference on Learning Representations (ICLR), 2024

  31. [31]

    Flow q-learning,

    S. Park, Q. Li, and S. Levine, “Flow q-learning,” inInternational Conference on Machine Learning, 2025

  32. [32]

    Reinflow: Fine-tuning flow matching policy with online reinforcement learning,

    T. Zhang, C. Yu, S. Su, and Y . Wang, “Reinflow: Fine-tuning flow matching policy with online reinforcement learning,”arXiv preprint arXiv:2505.22094, 2025

  33. [33]

    Sac flow: Sample-efficient reinforcement learning of flow- based policies via velocity-reparameterized sequential modeling,

    Y . Zhang, S. Yu, T. Zhang, M. Guang, H. Hui, K. Long, Y . Wang, C. Yu, and W. Ding, “Sac flow: Sample-efficient reinforcement learning of flow- based policies via velocity-reparameterized sequential modeling,”arXiv preprint arXiv:2509.25756, 2025

  34. [34]

    Flow-GRPO: Training Flow Matching Models via Online RL

    J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online rl,” arXiv preprint arXiv:2505.05470, 2025

  35. [35]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huanget al., “Dancegrpo: Unleashing grpo on visual generation,”arXiv preprint arXiv:2505.07818, 2025

  36. [36]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    J. Li, Y . Cui, T. Huang, Y . Ma, C. Fan, M. Yang, and Z. Zhong, “Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde,” arXiv preprint arXiv:2507.21802, 2025

  37. [37]

    Smooth exploration for robotic reinforcement learning,

    A. Raffin and F. Stulp, “Smooth exploration for robotic reinforcement learning,” inConference on Robot Learning. PMLR, 2021, pp. 1634– 1644

  38. [38]

    Latent exploration for reinforcement learning,

    A. S. Chiappa, A. Marin Vargas, A. Huang, and A. Mathis, “Latent exploration for reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 56 508–56 530, 2023

  39. [39]

    Colored noise in ppo: improved exploration and performance through correlated action sampling,

    J. Hollenstein, G. Martius, and J. Piater, “Colored noise in ppo: improved exploration and performance through correlated action sampling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 466–12 472

  40. [40]

    On entropy approximation for gaussian mixture random vectors,

    M. F. Huber, T. Bailey, H. Durrant-Whyte, and U. D. Hanebeck, “On entropy approximation for gaussian mixture random vectors,” in2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. IEEE, 2008, pp. 181–188

  41. [41]

    Maximum a posteriori policy optimisation,

    A. Abdolmaleki, J. T. Springenberg, Y . Tassa, R. Munos, N. Heess, and M. Riedmiller, “Maximum a posteriori policy optimisation,” in International Conference on Learning Representations, 2018

  42. [42]

    Asynchronous methods for deep rein- forcement learning,

    V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- forcement learning,” inInternational conference on machine learning. PmLR, 2016, pp. 1928–1937

  43. [43]

    Trust region policy optimization,

    J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

  44. [44]

    Continuous control with deep reinforcement learning,

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” inInternational Conference on Learning Representations, 2016

  45. [45]

    Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control,

    M. Nauman, M. Ostaszewski, K. Jankowski, P. Miło ´s, and M. Cygan, “Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control,”Advances in neural information processing systems, vol. 37, pp. 113 038–113 071, 2024

  46. [46]

    Sampling from energy-based policies using diffusion,

    V . Jain, T. Akhound-Sadegh, and S. Ravanbakhsh, “Sampling from energy-based policies using diffusion,”arXiv preprint arXiv:2410.01312, 2024