Dual-Flow Reinforcement Learning with State-Aware Exploration

Diange Yang; Kun Jiang; Qijun Li; Qi Song; Weitao Zhou; Yifei He; Zheng Fu

arxiv: 2606.29820 · v1 · pith:MOG7SL6Anew · submitted 2026-06-29 · 💻 cs.LG · cs.AI

Dual-Flow Reinforcement Learning with State-Aware Exploration

Qijun Li , Zheng Fu , Qi Song , Yifei He , Weitao Zhou , Kun Jiang , Diange Yang This is my paper

Pith reviewed 2026-06-30 07:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningcontinuous controlflow matchingmultimodal policyvalue estimationexplorationactor-critic

0 comments

The pith

Dual-Flow RL jointly models multimodal policies and return distributions via conditional flow matching to enable reliable value estimation and sustained exploration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in continuous-control reinforcement learning, optimal actions are often multimodal while returns are uncertain, so standard unimodal value estimators introduce bias and generative policies tend to collapse. It proposes a single actor-critic architecture that uses conditional flow matching to represent both a full return distribution and a multimodal action distribution at once. An additional Entropy-Covariance Exploration Regulator then modulates exploration on a per-state basis using policy entropy and action covariance. Experiments across DeepMind Control Suite and Humanoid-Bench show this dual-flow design yields state-of-the-art scores on most tasks and materially outperforms earlier diffusion- and flow-based methods.

Core claim

Dual-Flow RL is an actor-critic method that simultaneously learns a continuous return distribution and a multimodal policy distribution by applying conditional flow matching to both, while an Entropy-Covariance Exploration Regulator adjusts exploration intensity according to each state's policy entropy and action-uncertainty covariance; the resulting framework produces unbiased value estimates and prevents premature mode collapse, delivering superior performance on standard continuous-control benchmarks.

What carries the argument

Conditional flow matching applied in parallel to return and policy distributions, paired with the Entropy-Covariance Exploration Regulator that scales exploration using entropy and covariance per state.

If this is right

Value estimates remain unbiased even when returns are multimodal, removing a systematic source of error in actor-critic updates.
Policy distributions retain multiple distinct modes throughout training instead of collapsing, enabling continued exploration of high-return regions.
State-dependent regulation of exploration via entropy and covariance allows the agent to explore more in uncertain states and less in well-understood ones.
The same conditional-flow architecture can be dropped into existing actor-critic codebases without requiring separate generative models for policy and value.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to offline RL settings where multimodal return distributions are even more pronounced.
If the regulator proves robust, it could replace hand-tuned entropy bonuses in many current algorithms.
Training two coupled flow models increases compute; future work would need to measure whether the performance gain justifies the extra cost on larger state spaces.

Load-bearing premise

Conditional flow matching will keep producing diverse high-value action modes and unbiased return estimates across the tested control tasks without collapsing or drifting.

What would settle it

Run the same tasks with the dual-flow components replaced by standard Gaussian critics and unimodal policies; if performance drops to levels comparable to prior diffusion or flow methods, the joint-modeling claim is falsified.

Figures

Figures reproduced from arXiv: 2606.29820 by Diange Yang, Kun Jiang, Qijun Li, Qi Song, Weitao Zhou, Yifei He, Zheng Fu.

**Figure 2.** Figure 2: Ablations. (a–b) Evaluation curves on MuJoCo: We compare Dual-Flow with various diffusion and flow baselines on Humanoid-v3 and Ant-v3. Dual-Flow outperforms most baselines and achieves stronger overall performance. (c–d) Explorationstrength regulation: Dual-Flow on Humanoid-run with different exploration regulators, including ECER (ours), DACER-style entropy regulator, and fixed initial noise scale tunin… view at source ↗

**Figure 3.** Figure 3: Visualizing the return distribution. Columns from left to right show the predicted return distributions of Dual-Flow, C51, IQN, and DSAC, followed by the corresponding 1-Wasserstein distance. Dual-Flow better matches the ground-truth distributions and achieves the lowest error. V. EXPERIMENTS A. Experimental Setup Baselines. Our method is compared and evaluated against 13 model-free continuous-control base… view at source ↗

**Figure 4.** Figure 4: Evaluation curves on benchmarks. Solid lines denote the mean and shaded regions indicate the 95% confidence interval. IQN [18], and DSAC [9]. For ground truth, we fix the policy and estimate the distribution by performing repeated rollouts. For fair comparison, we use 5000 return samples and 60 bins for each histogram. These results demonstrate that our method provides a more accurate characterization of t… view at source ↗

**Figure 5.** Figure 5: Ablation studies. (a)–(b) Distributional critic vs. expected-value critic: Replacing the expected-value critic with our flow distributional critic leads to faster learning and higher final returns on Dog-trot (a) and Humanoid-run (b). (c)–(d) Sensitivity to flow steps on Dog-run: Learning curves (c) and IQM (d) across different M1 and M2 settings show that the method is robust to the choice of flow-step co… view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis to the base regularization parameter λ0. (a,c) show learning curves on Humanoid-run and Dogrun for different λ0. (b,d) report the corresponding aggregated final performance (IQM return) with 95% stratified bootstrap confidence intervals. Larger λ0 (e.g., 1, 5) restricts exploration and degrades performance, while smaller values (e.g., 0.05, 0.1) behave similarly. Regularization coeffi… view at source ↗

**Figure 7.** Figure 7: Visualizations of the considered environments. From the DeepMind Control Suite (DMC) [13], we include ten locomotion tasks: Humanoid (Run/Stand), Dog (Run/Trot/Stand/Walk), Walker (Run/Walk/Stand), and Quadruped (Walk). From H-Bench [14], we include H1-sit hard and H1-balance simple. From MuJoCo Gym [15], we include Humanoid-v3 and Ant-v3. These environments are characterized by high-dimensional state-acti… view at source ↗

read the original abstract

In complex continuous-control reinforcement learning tasks, multimodal optimal actions often coincide with uncertain, multimodal return distributions, making reliable value estimation and multimodal exploration challenging. Existing value estimation methods using unimodal Gaussians restrict expressiveness and yield biased estimates. Recent generative policies can represent multimodal actions but often collapse to a few modes and under-explore high-value areas of the action space. Motivated by these challenges, we propose Dual-Flow RL, a unified actor-critic framework that jointly models a continuous return distribution and a multimodal policy distribution using conditional flow matching (CFM). This design supports reliable value estimation and sustained multimodal exploration. To further enhance exploration, we introduce an Entropy-Covariance Exploration Regulator (ECER) that enables state-aware exploration regulation leveraging policy entropy and action-uncertainty covariance. Experiments on DeepMind Control Suite and Humanoid-Bench show that Dual-Flow RL achieves state-of-the-art performance on most tasks, significantly outperforming prior diffusion-based and flow-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual-Flow RL pairs conditional flow matching on both return distributions and policies with an ECER regulator and reports SOTA numbers on standard suites, though the empirical details will decide how much weight to give the claims.

read the letter

The main things to know are that this paper applies conditional flow matching to model both the critic's return distribution and the actor's policy distribution in one framework, then adds an Entropy-Covariance Exploration Regulator to adjust exploration in a state-aware way. That dual CFM setup plus the regulator is the concrete new element.

It does a clean job spelling out why unimodal value estimates and mode-collapsing generative policies are real limits in continuous control. The motivation lines up with known issues on tasks like those in DeepMind Control Suite. Treating the return distribution and policy jointly under the same flow-matching machinery is a reasonable unification that prior work on diffusion or flow policies did not emphasize.

The soft spots sit mostly in the experimental side. The abstract states clear outperformance over prior diffusion and flow methods, but without the tables, seed counts, ablation breakdowns, or statistical tests it is hard to judge whether the gains are stable or sensitive to post-hoc choices. The assumption that CFM will sustain multimodality without collapse or bias is stated as motivation rather than demonstrated in the given text, so that part still needs the data. No internal contradictions show up in the high-level description.

This is for people already working on generative actor-critic methods in RL who want to see flow matching extended to the value side. A reader tracking incremental improvements in exploration for continuous control would get value from the design.

It deserves a serious referee. The framework is coherent and the benchmarks are the right ones; referees can check the implementation details and strength of the results.

Referee Report

1 major / 2 minor

Summary. The paper proposes Dual-Flow RL, a unified actor-critic framework for continuous-control RL that jointly models a continuous return distribution and a multimodal policy distribution via conditional flow matching (CFM). It augments this with an Entropy-Covariance Exploration Regulator (ECER) that regulates exploration in a state-aware manner using policy entropy and action-uncertainty covariance. Experiments on DeepMind Control Suite and Humanoid-Bench are reported to show state-of-the-art performance on most tasks, outperforming prior diffusion-based and flow-based methods.

Significance. If the empirical results hold under rigorous evaluation, the work offers a coherent integration of CFM into an actor-critic loop that simultaneously targets reliable value estimation for multimodal returns and sustained multimodal exploration, addressing documented limitations of unimodal critics and mode-collapsing generative policies. The ECER component provides an explicit mechanism for state-dependent regulation that could be reusable beyond this architecture.

major comments (1)

[Experiments] The central empirical claim (SOTA on most tasks) is load-bearing for the contribution, yet the provided text supplies no visible details on experimental protocol, number of random seeds, statistical tests, baseline implementations, or ablation results that would allow verification that post-hoc hyperparameter choices or implementation details do not drive the reported gains.

minor comments (2)

[Method] Notation for the dual CFM objectives and the precise conditioning variables in the joint actor-critic update could be clarified with an explicit equation block early in the method section to aid reproducibility.
[Method] The abstract states that ECER 'enables state-aware exploration regulation,' but a short illustrative example or pseudocode showing how the covariance term modulates the entropy bonus per state would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the recognition of the potential significance of integrating conditional flow matching into an actor-critic framework with state-aware exploration. We address the major comment below.

read point-by-point responses

Referee: [Experiments] The central empirical claim (SOTA on most tasks) is load-bearing for the contribution, yet the provided text supplies no visible details on experimental protocol, number of random seeds, statistical tests, baseline implementations, or ablation results that would allow verification that post-hoc hyperparameter choices or implementation details do not drive the reported gains.

Authors: We agree that the manuscript as provided does not include sufficient explicit details on the experimental protocol, which limits independent verification of the SOTA claims. In the revised version we will add a dedicated experimental protocol subsection (and expanded appendix) specifying: the number of random seeds (10 seeds per task), reporting conventions (mean and standard deviation across seeds), statistical comparisons (paired t-tests against baselines with p<0.05 thresholds), baseline implementation details (original code repositories, any re-implementations or hyperparameter selections, and training budgets), and full ablation results isolating the contribution of the dual-flow architecture versus ECER. These additions will be placed in the main body where space permits and otherwise in the appendix, directly addressing concerns about post-hoc choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method proposal lacks derivations or self-referential reductions

full rationale

The provided abstract and description contain no equations, derivations, or explicit prediction steps. The paper proposes Dual-Flow RL as a joint CFM-based actor-critic plus ECER regulator, motivated by challenges in multimodal returns and exploration, then reports empirical results on standard benchmarks. No self-definitional constructions, fitted inputs renamed as predictions, or load-bearing self-citations appear. The approach builds on existing conditional flow matching without reducing any claimed result to its own inputs by construction. This is the expected non-finding for a high-level method proposal without mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are specified or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5710 in / 968 out tokens · 28099 ms · 2026-06-30T07:13:37.801309+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Scalable exploration for high- dimensional continuous control via value-guided flow,

Y . Wei, C. Zuo, and Y . Sui, “Scalable exploration for high- dimensional continuous control via value-guided flow,”arXiv preprint arXiv:2601.19707, 2026

work page arXiv 2026
[2]

Global–local decomposition of contextual representations in meta-reinforcement learning,

N. Ma, J. Xuan, G. Zhang, and J. Lu, “Global–local decomposition of contextual representations in meta-reinforcement learning,”IEEE Transactions on Cybernetics, vol. 55, no. 3, pp. 1277–1287, 2025

2025
[3]

Ekg-ac: A new paradigm for process industrial optimization based on offline reinforce- ment learning with expert knowledge guidance,

D. Liu, Y . Wang, C. Liu, B. Luo, and B. Huang, “Ekg-ac: A new paradigm for process industrial optimization based on offline reinforce- ment learning with expert knowledge guidance,”IEEE Transactions on Cybernetics, pp. 1–11, 2025

2025
[4]

Optimal tracking control of uncertain nonlinear systems using simplified reinforcement learning,

P. Ning, L. Duan, and C. Hua, “Optimal tracking control of uncertain nonlinear systems using simplified reinforcement learning,”IEEE Trans- actions on Cybernetics, vol. 56, no. 6, pp. 3200–3209, 2026

2026
[5]

Exploring the application of blockchain technology in crowdsource autonomous driving map updat- ing,

B. Wijaya, M. Yang, K. Jianget al., “Exploring the application of blockchain technology in crowdsource autonomous driving map updat- ing,”Communications in Transportation Research, vol. 4, p. 100140, 2024

2024
[6]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational conference on machine learning. Pmlr, 2018, pp. 1861–1870

2018
[7]

Addressing function approxi- mation error in actor-critic methods,

S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi- mation error in actor-critic methods,” inInternational conference on machine learning. PMLR, 2018, pp. 1587–1596

2018
[8]

Conservative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”Advances in neural information processing systems, vol. 33, pp. 1179–1191, 2020

2020
[9]

Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,

J. Duan, Y . Guan, S. E. Li, Y . Ren, Q. Sun, and B. Cheng, “Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,”IEEE transactions on neural networks and learning systems, vol. 33, no. 11, pp. 6584–6598, 2021

2021
[10]

Distributional reinforcement learning with quantile regression,

W. Dabney, M. Rowland, M. Bellemare, and R. Munos, “Distributional reinforcement learning with quantile regression,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

2018
[11]

Flowpg: action-constrained policy gradient with normalizing flows,

J. Brahmanage, J. Ling, and A. Kumar, “Flowpg: action-constrained policy gradient with normalizing flows,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 118–20 132, 2023

2023
[12]

Diffusion actor-critic with entropy regulator,

Y . Wang, L. Wang, Y . Jiang, W. Zou, T. Liu, X. Song, W. Wang, L. Xiao, J. Wu, J. Duanet al., “Diffusion actor-critic with entropy regulator,”Advances in Neural Information Processing Systems, vol. 37, pp. 54 183–54 204, 2024

2024
[13]

DeepMind Control Suite

Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, 12 A. Abdolmaleki, J. Merel, A. Lefrancqet al., “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation,

C. Sferrazza, D.-M. Huang, X. Lin, Y . Lee, and P. Abbeel, “Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation,” inRobotics: Science and Systems, 2024

2024
[15]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Flow-based pol- icy for online reinforcement learning,

L. Lv, Y . Li, Y . Luo, F. Sun, T. Kong, J. Xu, and X. Ma, “Flow-based pol- icy for online reinforcement learning,”arXiv preprint arXiv:2506.12811, 2025

work page arXiv 2025
[17]

A distributional perspec- tive on reinforcement learning,

M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec- tive on reinforcement learning,” inInternational conference on machine learning. PMLR, 2017, pp. 449–458

2017
[18]

Implicit quantile networks for distributional reinforcement learning,

W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantile networks for distributional reinforcement learning,” inInternational conference on machine learning. PMLR, 2018, pp. 1096–1105

2018
[19]

Distributed distributional de- terministic policy gradients,

G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional de- terministic policy gradients,” inInternational Conference on Learning Representations, 2018

2018
[20]

Conservative offline distribu- tional reinforcement learning,

Y . Ma, D. Jayaraman, and O. Bastani, “Conservative offline distribu- tional reinforcement learning,”Advances in neural information process- ing systems, vol. 34, pp. 19 235–19 247, 2021

2021
[21]

Bellman diffusion: Generative modeling as learning a linear operator in the distribution space,

Y . Li, C.-H. Lai, C.-B. Sch ¨onlieb, Y . Mitsufuji, and S. Ermon, “Bellman diffusion: Generative modeling as learning a linear operator in the distribution space,”arXiv preprint arXiv:2410.01796, 2024

work page arXiv 2024
[22]

Parameter-Efficient Distributional RL via Normalizing Flows and a Geometry-Aware Cram\'er Surrogate

R. Kaddah, J. Read, M.-P. Caniet al., “Flow models for unbounded and geometry-aware distributional reinforcement learning,”arXiv preprint arXiv:2505.04310, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Off-policy deep reinforcement learning without exploration,

S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” inInternational conference on machine learning. PMLR, 2019, pp. 2052–2062

2019
[24]

Offline reinforcement learning with implicit q-learning,

I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning,” inInternational Conference on Learning Representations, 2022

2022
[25]

Generative adversarial imitation learning,

J. Ho and S. Ermon, “Generative adversarial imitation learning,”Ad- vances in neural information processing systems, vol. 29, 2016

2016
[26]

Diffusion-based reinforcement learning via q-weighted variational pol- icy optimization,

S. Ding, K. Hu, Z. Zhang, K. Ren, W. Zhang, J. Yu, J. Wang, and Y . Shi, “Diffusion-based reinforcement learning via q-weighted variational pol- icy optimization,”Advances in Neural Information Processing Systems, vol. 37, pp. 53 945–53 968, 2024

2024
[27]

Energy-weighted flow matching for offline reinforcement learning,

S. Zhang, W. Zhang, and Q. Gu, “Energy-weighted flow matching for offline reinforcement learning,”arXiv preprint arXiv:2503.04975, 2025

work page arXiv 2025
[28]

Diffusion policies as an expressive policy class for offline reinforcement learning,

Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expressive policy class for offline reinforcement learning,” inInternational Confer- ence on Learning Representations, 2023

2023
[29]

Learning a diffusion model policy from rewards via q-score matching,

M. Psenka, A. Escontrela, P. Abbeel, and Y . Ma, “Learning a diffusion model policy from rewards via q-score matching,” inInternational Conference on Machine Learning, 2024

2024
[30]

Diffcps: Diffusion model based constrained policy search for offline reinforcement learning,

L. He, L. Zhang, J. Tan, and X. Wang, “Diffcps: Diffusion model based constrained policy search for offline reinforcement learning,” in International Conference on Learning Representations (ICLR), 2024

2024
[31]

Flow q-learning,

S. Park, Q. Li, and S. Levine, “Flow q-learning,” inInternational Conference on Machine Learning, 2025

2025
[32]

Reinflow: Fine-tuning flow matching policy with online reinforcement learning,

T. Zhang, C. Yu, S. Su, and Y . Wang, “Reinflow: Fine-tuning flow matching policy with online reinforcement learning,”arXiv preprint arXiv:2505.22094, 2025

work page arXiv 2025
[33]

Sac flow: Sample-efficient reinforcement learning of flow- based policies via velocity-reparameterized sequential modeling,

Y . Zhang, S. Yu, T. Zhang, M. Guang, H. Hui, K. Long, Y . Wang, C. Yu, and W. Ding, “Sac flow: Sample-efficient reinforcement learning of flow- based policies via velocity-reparameterized sequential modeling,”arXiv preprint arXiv:2509.25756, 2025

work page arXiv 2025
[34]

Flow-GRPO: Training Flow Matching Models via Online RL

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online rl,” arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

DanceGRPO: Unleashing GRPO on Visual Generation

Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huanget al., “Dancegrpo: Unleashing grpo on visual generation,”arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

J. Li, Y . Cui, T. Huang, Y . Ma, C. Fan, M. Yang, and Z. Zhong, “Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde,” arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Smooth exploration for robotic reinforcement learning,

A. Raffin and F. Stulp, “Smooth exploration for robotic reinforcement learning,” inConference on Robot Learning. PMLR, 2021, pp. 1634– 1644

2021
[38]

Latent exploration for reinforcement learning,

A. S. Chiappa, A. Marin Vargas, A. Huang, and A. Mathis, “Latent exploration for reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 56 508–56 530, 2023

2023
[39]

Colored noise in ppo: improved exploration and performance through correlated action sampling,

J. Hollenstein, G. Martius, and J. Piater, “Colored noise in ppo: improved exploration and performance through correlated action sampling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 466–12 472

2024
[40]

On entropy approximation for gaussian mixture random vectors,

M. F. Huber, T. Bailey, H. Durrant-Whyte, and U. D. Hanebeck, “On entropy approximation for gaussian mixture random vectors,” in2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. IEEE, 2008, pp. 181–188

2008
[41]

Maximum a posteriori policy optimisation,

A. Abdolmaleki, J. T. Springenberg, Y . Tassa, R. Munos, N. Heess, and M. Riedmiller, “Maximum a posteriori policy optimisation,” in International Conference on Learning Representations, 2018

2018
[42]

Asynchronous methods for deep rein- forcement learning,

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- forcement learning,” inInternational conference on machine learning. PmLR, 2016, pp. 1928–1937

2016
[43]

Trust region policy optimization,

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

2015
[44]

Continuous control with deep reinforcement learning,

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” inInternational Conference on Learning Representations, 2016

2016
[45]

Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control,

M. Nauman, M. Ostaszewski, K. Jankowski, P. Miło ´s, and M. Cygan, “Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control,”Advances in neural information processing systems, vol. 37, pp. 113 038–113 071, 2024

2024
[46]

Sampling from energy-based policies using diffusion,

V . Jain, T. Akhound-Sadegh, and S. Ravanbakhsh, “Sampling from energy-based policies using diffusion,”arXiv preprint arXiv:2410.01312, 2024

work page arXiv 2024

[1] [1]

Scalable exploration for high- dimensional continuous control via value-guided flow,

Y . Wei, C. Zuo, and Y . Sui, “Scalable exploration for high- dimensional continuous control via value-guided flow,”arXiv preprint arXiv:2601.19707, 2026

work page arXiv 2026

[2] [2]

Global–local decomposition of contextual representations in meta-reinforcement learning,

N. Ma, J. Xuan, G. Zhang, and J. Lu, “Global–local decomposition of contextual representations in meta-reinforcement learning,”IEEE Transactions on Cybernetics, vol. 55, no. 3, pp. 1277–1287, 2025

2025

[3] [3]

Ekg-ac: A new paradigm for process industrial optimization based on offline reinforce- ment learning with expert knowledge guidance,

D. Liu, Y . Wang, C. Liu, B. Luo, and B. Huang, “Ekg-ac: A new paradigm for process industrial optimization based on offline reinforce- ment learning with expert knowledge guidance,”IEEE Transactions on Cybernetics, pp. 1–11, 2025

2025

[4] [4]

Optimal tracking control of uncertain nonlinear systems using simplified reinforcement learning,

P. Ning, L. Duan, and C. Hua, “Optimal tracking control of uncertain nonlinear systems using simplified reinforcement learning,”IEEE Trans- actions on Cybernetics, vol. 56, no. 6, pp. 3200–3209, 2026

2026

[5] [5]

Exploring the application of blockchain technology in crowdsource autonomous driving map updat- ing,

B. Wijaya, M. Yang, K. Jianget al., “Exploring the application of blockchain technology in crowdsource autonomous driving map updat- ing,”Communications in Transportation Research, vol. 4, p. 100140, 2024

2024

[6] [6]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational conference on machine learning. Pmlr, 2018, pp. 1861–1870

2018

[7] [7]

Addressing function approxi- mation error in actor-critic methods,

S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi- mation error in actor-critic methods,” inInternational conference on machine learning. PMLR, 2018, pp. 1587–1596

2018

[8] [8]

Conservative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”Advances in neural information processing systems, vol. 33, pp. 1179–1191, 2020

2020

[9] [9]

Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,

J. Duan, Y . Guan, S. E. Li, Y . Ren, Q. Sun, and B. Cheng, “Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors,”IEEE transactions on neural networks and learning systems, vol. 33, no. 11, pp. 6584–6598, 2021

2021

[10] [10]

Distributional reinforcement learning with quantile regression,

W. Dabney, M. Rowland, M. Bellemare, and R. Munos, “Distributional reinforcement learning with quantile regression,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

2018

[11] [11]

Flowpg: action-constrained policy gradient with normalizing flows,

J. Brahmanage, J. Ling, and A. Kumar, “Flowpg: action-constrained policy gradient with normalizing flows,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 118–20 132, 2023

2023

[12] [12]

Diffusion actor-critic with entropy regulator,

Y . Wang, L. Wang, Y . Jiang, W. Zou, T. Liu, X. Song, W. Wang, L. Xiao, J. Wu, J. Duanet al., “Diffusion actor-critic with entropy regulator,”Advances in Neural Information Processing Systems, vol. 37, pp. 54 183–54 204, 2024

2024

[13] [13]

DeepMind Control Suite

Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, 12 A. Abdolmaleki, J. Merel, A. Lefrancqet al., “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation,

C. Sferrazza, D.-M. Huang, X. Lin, Y . Lee, and P. Abbeel, “Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation,” inRobotics: Science and Systems, 2024

2024

[15] [15]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

Flow-based pol- icy for online reinforcement learning,

L. Lv, Y . Li, Y . Luo, F. Sun, T. Kong, J. Xu, and X. Ma, “Flow-based pol- icy for online reinforcement learning,”arXiv preprint arXiv:2506.12811, 2025

work page arXiv 2025

[17] [17]

A distributional perspec- tive on reinforcement learning,

M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspec- tive on reinforcement learning,” inInternational conference on machine learning. PMLR, 2017, pp. 449–458

2017

[18] [18]

Implicit quantile networks for distributional reinforcement learning,

W. Dabney, G. Ostrovski, D. Silver, and R. Munos, “Implicit quantile networks for distributional reinforcement learning,” inInternational conference on machine learning. PMLR, 2018, pp. 1096–1105

2018

[19] [19]

Distributed distributional de- terministic policy gradients,

G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional de- terministic policy gradients,” inInternational Conference on Learning Representations, 2018

2018

[20] [20]

Conservative offline distribu- tional reinforcement learning,

Y . Ma, D. Jayaraman, and O. Bastani, “Conservative offline distribu- tional reinforcement learning,”Advances in neural information process- ing systems, vol. 34, pp. 19 235–19 247, 2021

2021

[21] [21]

Bellman diffusion: Generative modeling as learning a linear operator in the distribution space,

Y . Li, C.-H. Lai, C.-B. Sch ¨onlieb, Y . Mitsufuji, and S. Ermon, “Bellman diffusion: Generative modeling as learning a linear operator in the distribution space,”arXiv preprint arXiv:2410.01796, 2024

work page arXiv 2024

[22] [22]

Parameter-Efficient Distributional RL via Normalizing Flows and a Geometry-Aware Cram\'er Surrogate

R. Kaddah, J. Read, M.-P. Caniet al., “Flow models for unbounded and geometry-aware distributional reinforcement learning,”arXiv preprint arXiv:2505.04310, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Off-policy deep reinforcement learning without exploration,

S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” inInternational conference on machine learning. PMLR, 2019, pp. 2052–2062

2019

[24] [24]

Offline reinforcement learning with implicit q-learning,

I. Kostrikov, A. Nair, and S. Levine, “Offline reinforcement learning with implicit q-learning,” inInternational Conference on Learning Representations, 2022

2022

[25] [25]

Generative adversarial imitation learning,

J. Ho and S. Ermon, “Generative adversarial imitation learning,”Ad- vances in neural information processing systems, vol. 29, 2016

2016

[26] [26]

Diffusion-based reinforcement learning via q-weighted variational pol- icy optimization,

S. Ding, K. Hu, Z. Zhang, K. Ren, W. Zhang, J. Yu, J. Wang, and Y . Shi, “Diffusion-based reinforcement learning via q-weighted variational pol- icy optimization,”Advances in Neural Information Processing Systems, vol. 37, pp. 53 945–53 968, 2024

2024

[27] [27]

Energy-weighted flow matching for offline reinforcement learning,

S. Zhang, W. Zhang, and Q. Gu, “Energy-weighted flow matching for offline reinforcement learning,”arXiv preprint arXiv:2503.04975, 2025

work page arXiv 2025

[28] [28]

Diffusion policies as an expressive policy class for offline reinforcement learning,

Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expressive policy class for offline reinforcement learning,” inInternational Confer- ence on Learning Representations, 2023

2023

[29] [29]

Learning a diffusion model policy from rewards via q-score matching,

M. Psenka, A. Escontrela, P. Abbeel, and Y . Ma, “Learning a diffusion model policy from rewards via q-score matching,” inInternational Conference on Machine Learning, 2024

2024

[30] [30]

Diffcps: Diffusion model based constrained policy search for offline reinforcement learning,

L. He, L. Zhang, J. Tan, and X. Wang, “Diffcps: Diffusion model based constrained policy search for offline reinforcement learning,” in International Conference on Learning Representations (ICLR), 2024

2024

[31] [31]

Flow q-learning,

S. Park, Q. Li, and S. Levine, “Flow q-learning,” inInternational Conference on Machine Learning, 2025

2025

[32] [32]

Reinflow: Fine-tuning flow matching policy with online reinforcement learning,

T. Zhang, C. Yu, S. Su, and Y . Wang, “Reinflow: Fine-tuning flow matching policy with online reinforcement learning,”arXiv preprint arXiv:2505.22094, 2025

work page arXiv 2025

[33] [33]

Sac flow: Sample-efficient reinforcement learning of flow- based policies via velocity-reparameterized sequential modeling,

Y . Zhang, S. Yu, T. Zhang, M. Guang, H. Hui, K. Long, Y . Wang, C. Yu, and W. Ding, “Sac flow: Sample-efficient reinforcement learning of flow- based policies via velocity-reparameterized sequential modeling,”arXiv preprint arXiv:2509.25756, 2025

work page arXiv 2025

[34] [34]

Flow-GRPO: Training Flow Matching Models via Online RL

J. Liu, G. Liu, J. Liang, Y . Li, J. Liu, X. Wang, P. Wan, D. Zhang, and W. Ouyang, “Flow-grpo: Training flow matching models via online rl,” arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

DanceGRPO: Unleashing GRPO on Visual Generation

Z. Xue, J. Wu, Y . Gao, F. Kong, L. Zhu, M. Chen, Z. Liu, W. Liu, Q. Guo, W. Huanget al., “Dancegrpo: Unleashing grpo on visual generation,”arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

J. Li, Y . Cui, T. Huang, Y . Ma, C. Fan, M. Yang, and Z. Zhong, “Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde,” arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Smooth exploration for robotic reinforcement learning,

A. Raffin and F. Stulp, “Smooth exploration for robotic reinforcement learning,” inConference on Robot Learning. PMLR, 2021, pp. 1634– 1644

2021

[38] [38]

Latent exploration for reinforcement learning,

A. S. Chiappa, A. Marin Vargas, A. Huang, and A. Mathis, “Latent exploration for reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, pp. 56 508–56 530, 2023

2023

[39] [39]

Colored noise in ppo: improved exploration and performance through correlated action sampling,

J. Hollenstein, G. Martius, and J. Piater, “Colored noise in ppo: improved exploration and performance through correlated action sampling,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 466–12 472

2024

[40] [40]

On entropy approximation for gaussian mixture random vectors,

M. F. Huber, T. Bailey, H. Durrant-Whyte, and U. D. Hanebeck, “On entropy approximation for gaussian mixture random vectors,” in2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. IEEE, 2008, pp. 181–188

2008

[41] [41]

Maximum a posteriori policy optimisation,

A. Abdolmaleki, J. T. Springenberg, Y . Tassa, R. Munos, N. Heess, and M. Riedmiller, “Maximum a posteriori policy optimisation,” in International Conference on Learning Representations, 2018

2018

[42] [42]

Asynchronous methods for deep rein- forcement learning,

V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- forcement learning,” inInternational conference on machine learning. PmLR, 2016, pp. 1928–1937

2016

[43] [43]

Trust region policy optimization,

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning. PMLR, 2015, pp. 1889–1897

2015

[44] [44]

Continuous control with deep reinforcement learning,

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” inInternational Conference on Learning Representations, 2016

2016

[45] [45]

Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control,

M. Nauman, M. Ostaszewski, K. Jankowski, P. Miło ´s, and M. Cygan, “Bigger, regularized, optimistic: scaling for compute and sample efficient continuous control,”Advances in neural information processing systems, vol. 37, pp. 113 038–113 071, 2024

2024

[46] [46]

Sampling from energy-based policies using diffusion,

V . Jain, T. Akhound-Sadegh, and S. Ravanbakhsh, “Sampling from energy-based policies using diffusion,”arXiv preprint arXiv:2410.01312, 2024

work page arXiv 2024