A note on convergence of Wasserstein policy optimization

David \v{S}i\v{s}ka; Yufei Zhang

arxiv: 2605.22622 · v1 · pith:3EVZ5R2Unew · submitted 2026-05-21 · 💻 cs.LG · math.OC

A note on convergence of Wasserstein policy optimization

David \v{S}i\v{s}ka , Yufei Zhang This is my paper

Pith reviewed 2026-05-22 06:33 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords Wasserstein Policy Optimizationlinear convergenceentropy-regularized MDPsgradient flowslog-Sobolev inequalityreinforcement learningcontinuous action spaces

0 comments

The pith

Wasserstein Policy Optimization converges linearly to the global optimum under entropy regularization in continuous MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that Wasserstein Policy Optimization, which optimizes policies via Wasserstein gradient flows, achieves linear convergence when embedded in entropy-regularized Markov Decision Processes. It establishes this by showing that the flow dissipates energy monotonically and satisfies a local log-Sobolev inequality, once a sufficiently regular solution to the gradient flow equation is assumed to exist. These two properties together imply that the value function approaches the global optimum at a linear rate. The result supplies the missing theoretical guarantee for an algorithm already observed to work well on continuous-state and continuous-action tasks.

Core claim

Within the framework of entropy-regularised Markov Decision Processes, Wasserstein Policy Optimization converges linearly. This is done by leveraging recent advances in mean-field analysis for convergence of gradient flows using log-Sobolev inequalities. Assuming existence of sufficiently regular solution to the gradient flow equation we demonstrate monotonic energy dissipation along the flow and establish a local log-Sobolev inequality. Ultimately, these properties allow us to argue that the value function should converge linearly to the global optimum.

What carries the argument

The gradient flow of the entropy-regularized objective in the Wasserstein space of probability measures over policies, analyzed via monotonic energy dissipation and a local log-Sobolev inequality.

If this is right

The value function converges linearly to the global optimum.
Energy decreases monotonically along the Wasserstein gradient flow.
A local log-Sobolev inequality holds for the regularized objective under the regularity assumption.
Linear convergence extends to the full policy optimization problem in continuous state-action spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar dissipation-plus-log-Sobolev arguments might apply to other gradient-flow formulations of policy search.
The linear rate could be used to set step-size schedules or early-stopping criteria in practical implementations.
Removing the regularity assumption would require new tools from analysis of singular gradient flows.

Load-bearing premise

A sufficiently regular solution to the gradient flow equation exists.

What would settle it

A concrete continuous MDP in which the Wasserstein gradient flow solution loses regularity or the observed convergence rate of the value function is sub-linear.

read the original abstract

Wasserstein Policy Optimization (WPO) is a recently proposed reinforcement learning algorithm that leverages Wasserstein gradient flows to optimize stochastic policies in continuous action spaces. Despite its empirical success, the theoretical convergence properties of WPO in environments with continuous state and action spaces have yet to be fully established. In this note, we argue that WPO within the framework of entropy-regularised Markov Decision Processes converges linearly. This is done by leveraging recent advances in mean-field analysis for convergence of gradient flows using log-Sobole inequalities. Assuming existence of sufficiently regular solution to the gradient flow equation we demonstrate monotonic energy dissipation along the flow and establish a local log-Sobolev inequality. Ultimately, these properties allow us to argue that the value function should converge linearly to the global optimum.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The note derives a conditional linear convergence rate for WPO by applying mean-field log-Sobolev tools, but the argument stops at an unverified regularity assumption on the gradient flow.

read the letter

The main takeaway is that this short note applies recent mean-field gradient-flow techniques to show linear convergence of the value function for Wasserstein policy optimization in entropy-regularized MDPs. They establish monotonic energy dissipation and a local log-Sobolev inequality along the flow, then conclude linear convergence to the global optimum. That connection to existing log-Sobolev results is the concrete step forward; it takes an empirically used algorithm and gives it a theoretical rate under the right conditions. The writing is direct and the logic follows once the pieces are granted. The obvious limitation is the standing assumption that a sufficiently regular solution to the gradient flow PDE exists. In continuous state-action spaces the Wasserstein flow can lose smoothness or concentrate, and the note does not supply an independent argument or check that this regularity actually holds for the WPO objective. Without that step the linear rate remains conditional. No explicit constants or quantitative error bounds appear in the abstract, which is typical for a note but leaves the practical strength of the result open. The citations track the relevant mean-field literature without obvious gaps or circularity. This is the sort of paper that belongs in a reading group for people working on continuous-action RL theory or Wasserstein methods; a specialist can extract the technique and decide whether the regularity can be closed separately. It is worth sending to peer review because the core argument is coherent and the gap is stated plainly rather than hidden; referees can push on whether the assumption is justifiable or needs to be removed.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that Wasserstein Policy Optimization (WPO) in entropy-regularized MDPs over continuous state-action spaces converges linearly to the global optimum. This is established by showing monotonic energy dissipation along the Wasserstein gradient flow and deriving a local log-Sobolev inequality, conditional on the existence of a sufficiently regular solution to the gradient flow PDE, and by invoking recent mean-field analysis techniques.

Significance. If the regularity assumption can be justified, the note would provide useful theoretical grounding for the linear convergence of an empirically successful method, correctly identifying the role of energy dissipation and log-Sobolev inequalities in mean-field RL analysis. The approach aligns with standard techniques in the field and highlights a clear path from gradient-flow properties to value-function convergence.

major comments (1)

[Abstract and main argument] Abstract and central argument: the linear convergence of the value function is derived only after assuming existence of a sufficiently regular solution to the gradient flow PDE. This assumption is required both for monotonic energy dissipation and for the local log-Sobolev inequality. In continuous-state entropy-regularized MDPs the objective is typically non-convex, and Wasserstein flows on such objectives can lose smoothness or develop concentrations; no independent verification, sufficient conditions, or reference establishing the required regularity (e.g., bounded density or Sobolev control) is supplied. Because the linear rate does not follow without this step, the assumption is load-bearing for the main claim.

minor comments (2)

[Abstract] The abstract refers to 'recent advances in mean-field analysis' without citing the specific works; adding explicit references would improve traceability.
[Notation and setup] Notation for the evolving policy measure and the associated energy functional should be introduced once and used consistently to avoid ambiguity in the flow equations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for acknowledging the alignment of our approach with standard mean-field techniques. We address the major comment regarding the regularity assumption below.

read point-by-point responses

Referee: [Abstract and main argument] Abstract and central argument: the linear convergence of the value function is derived only after assuming existence of a sufficiently regular solution to the gradient flow PDE. This assumption is required both for monotonic energy dissipation and for the local log-Sobolev inequality. In continuous-state entropy-regularized MDPs the objective is typically non-convex, and Wasserstein flows on such objectives can lose smoothness or develop concentrations; no independent verification, sufficient conditions, or reference establishing the required regularity (e.g., bounded density or Sobolev control) is supplied. Because the linear rate does not follow without this step, the assumption is load-bearing for the main claim.

Authors: We agree that the existence of a sufficiently regular solution to the gradient flow PDE is a load-bearing assumption, as it underpins both the monotonic energy dissipation and the local log-Sobolev inequality used to obtain the linear convergence rate. The manuscript is explicitly framed as a note deriving the convergence result conditionally on this regularity, rather than establishing the regularity itself. This conditional structure is standard in mean-field gradient flow analyses, particularly for non-convex objectives where global regularity can be difficult to verify without additional assumptions on the MDP or policy class. We will revise the manuscript to expand the discussion of this limitation, clarify its role in the argument, and include references to related works that employ analogous regularity assumptions in Wasserstein gradient flows and mean-field RL (e.g., papers invoking local log-Sobolev inequalities under density bounds or Sobolev regularity). We do not provide new sufficient conditions for regularity here, as that would constitute a separate technical contribution beyond the scope of this note. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is conditional on explicit assumption and external advances

full rationale

The paper states its central argument under an explicit assumption of existence of a sufficiently regular solution to the gradient flow equation, then uses this to show monotonic energy dissipation and a local log-Sobolev inequality before concluding linear convergence of the value function. It leverages recent external advances in mean-field analysis rather than deriving the key inequalities from its own fitted quantities or prior self-citations. No step in the provided derivation chain reduces a claimed result to an input by construction, renames a known pattern, or imports uniqueness via overlapping-author citations that bear the full load. The argument is therefore self-contained as a conditional analysis against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The argument depends on an existence assumption for a sufficiently regular solution to the gradient flow equation and on the validity of a local log-Sobolev inequality in this setting; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Existence of a sufficiently regular solution to the gradient flow equation
Explicitly stated in the abstract as the starting point for demonstrating monotonic energy dissipation and the local log-Sobolev inequality.

pith-pipeline@v0.9.0 · 5656 in / 1262 out tokens · 33941 ms · 2026-05-22T06:33:42.004175+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assuming existence of sufficiently regular solution to the gradient flow equation we demonstrate monotonic energy dissipation along the flow and establish a local log-Sobolev inequality.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that energy dissipates monotonically along the flow... value function converges exponentially fast to the global optimal value.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

143 extracted references · 143 canonical work pages · 1 internal anchor

[1]

Linear convergence of proximal descent schemes on the

Lascu, Razvan-Andrei and Majka, Mateusz B and. Linear convergence of proximal descent schemes on the. arXiv preprint arXiv:2411.15067 , year=

work page arXiv
[2]

arXiv preprint arXiv:2505.00663v1 , year=

Wasserstein Policy Optimization , author=. arXiv preprint arXiv:2505.00663v1 , year=

work page arXiv
[3]

Kerimkulov, Bekzhan and Leahy, James-Michael and Siska, David and Szpruch, Lukasz and Zhang, Yufei , journal=. A. 2025 , publisher=

work page 2025
[4]

1986 , publisher=

Logarithmic Sobolev inequalities and stochastic Ising models , author=. 1986 , publisher=

work page 1986
[5]

Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94) , volume=

Reinforcement learning in continuous time: Advantage updating , author=. Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94) , volume=. 1994 , organization=

work page 1994
[6]

Making deep

Tallec, Corentin and Blier, L. Making deep. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019
[7]

arXiv preprint arXiv:2202.01009 , year=

Mean-field langevin dynamics: Exponential convergence and annealing , author=. arXiv preprint arXiv:2202.01009 , year=

work page arXiv
[8]

Convex analysis of the mean field

Nitanda, Atsushi and Wu, Denny and Suzuki, Taiji , booktitle=. Convex analysis of the mean field. 2022 , organization=

work page 2022
[9]

arXiv preprint arXiv:2105.08368 , year=

Convergence rates of gradient methods for convex optimization in the space of measures , author=. arXiv preprint arXiv:2105.08368 , year=

work page arXiv
[10]

CS Dept., UW Seattle, Seattle, WA, USA, Tech

Reinforcement learning: Theory and algorithms , author=. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep , volume=

work page
[11]

Systems & Control Letters , volume=

Remarks on input to state stability of perturbed gradient flows, motivated by model-free feedback control learning , author=. Systems & Control Letters , volume=. 2022 , publisher=

work page 2022
[12]

arXiv preprint arXiv:2211.00617 , year=

Convergence of policy gradient methods for finite-horizon stochastic linear-quadratic control problems , author=. arXiv preprint arXiv:2211.00617 , year=

work page arXiv
[13]

2016 , publisher=

Information geometry and its applications , author=. 2016 , publisher=

work page 2016
[14]

Optimal transport for applied mathematicians , author=. Birk. 2015 , publisher=

work page 2015
[15]

Gallou. A. SIAM Journal on Mathematical Analysis , volume=. 2017 , publisher=

work page 2017
[16]

Neural computation , volume=

Natural gradient works efficiently in learning , author=. Neural computation , volume=. 1998 , publisher=

work page 1998
[17]

Gradient flows for regularized stochastic control problems , author =

work page
[18]

On linear and super-linear convergence of Natural Policy Gradient algorithm , journal =

Sajad Khodadadian and Prakirt Raj Jhunjhunwala and Sushil Mahavir Varma and Siva Theja Maguluri , keywords =. On linear and super-linear convergence of Natural Policy Gradient algorithm , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.sysconle.2022.105214 , url =

work page doi:10.1016/j.sysconle.2022.105214 2022
[19]

arXiv preprint arXiv:2308.07591 , year=

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria , author=. arXiv preprint arXiv:2308.07591 , year=

work page arXiv
[20]

Optimality and approximation with policy gradient methods in

Agarwal, Alekh and Kakade, Sham M and Lee, Jason D and Mahajan, Gaurav , year = 2020, booktitle =. Optimality and approximation with policy gradient methods in

work page 2020
[21]

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , author =. J. Mach. Learn. Res. , volume = 22, number = 98, pages =

work page
[22]

Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime , author =

work page
[23]

Global Optimality Of Softmax Policy Gradient With Single Hidden Layer Neural Networks In The Mean-field Regime , author =

work page
[24]

Linear convergence for natural policy gradient with log-linear policy parametrization , author =

work page
[25]

Infinite Dimensional Analysis:

Aliprantis, Charalambos D and Border, Kim C , year = 2006, publisher =. Infinite Dimensional Analysis:

work page 2006
[26]

Methods of information geometry , author =

work page
[27]

Gradient flows: in metric spaces and in the space of probability measures , author =

work page
[28]

Mirror descent with relative smoothness in measure spaces, with application to

Aubin-Frankowski, Pierre-Cyril and Korba, Anna and L. Mirror descent with relative smoothness in measure spaces, with application to. Advances in Neural Information Processing Systems , volume = 35, pages =

work page
[29]

Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space , author =

work page
[30]

First-order methods in optimization , author =

work page
[31]

International Conference on Machine Learning , pages =

On the hidden biases of policy mirror ascent in continuous action spaces , author =. International Conference on Machine Learning , pages =

work page
[32]

On the sample complexity and metastability of heavy-tailed policy search in continuous control , author =

work page
[33]

Stochastic optimal control: the discrete-time case , author =

work page
[34]

Global optimality guarantees for policy gradient methods , author =

work page
[35]

2019 IEEE 58th Conference on Decision and Control (CDC) , pages =

Policy gradient using weak derivatives for reinforcement learning , author =. 2019 IEEE 58th Conference on Decision and Control (CDC) , pages =

work page 2019
[36]

Distances between Stationary Distributions of Diffusions and Solvability of Nonlinear

Bogachev, Vladimir I and Kirillov, Andrei I and Shaposhnikov, Stanislav V , year = 2018, journal =. Distances between Stationary Distributions of Diffusions and Solvability of Nonlinear

work page 2018
[37]

Bogachev, Vladimir I and Krylov, Nicolai V and R

work page
[38]

Convergence in variation of solutions of nonlinear

Bogachev, Vladimir I and R. Convergence in variation of solutions of nonlinear. Journal of Functional Analysis , publisher =

work page
[39]

Distances between transition probabilities of diffusions and applications to nonlinear

Bogachev, Vladimir I and R. Distances between transition probabilities of diffusions and applications to nonlinear. Journal of Functional Analysis , publisher =

work page
[40]

Perturbation analysis of optimization problems , author =

work page
[41]

Functional analysis, Sobolev spaces and partial differential equations , author =

work page
[42]

On ergodic properties of nonlinear

Butkovsky, Oleg A , year = 2014, journal =. On ergodic properties of nonlinear

work page 2014
[43]

Probabilistic Theory of Mean Field Games with Applications I-II , author =

work page
[44]

Annual Review of Control, Robotics, and Autonomous Systems , volume=

Adaptive Control and Intersections with Reinforcement Learning , author=. Annual Review of Control, Robotics, and Autonomous Systems , volume=. 2023 , publisher=

work page 2023
[45]

Linear convergence of entropy-regularized natural policy gradient with linear function approximation , author =

work page
[46]

Operations Research , publisher =

Fast global convergence of natural policy gradient methods with entropy regularization , author =. Operations Research , publisher =

work page
[47]

Advances in neural information processing systems , pages =

On the global convergence of gradient descent for over-parameterized models using optimal transport , author =. Advances in neural information processing systems , pages =

work page
[48]

Linear and nonlinear functional analysis with applications , author =

work page
[49]

Neural computation , volume=

Reinforcement learning in continuous time and space , author=. Neural computation , volume=. 2000 , publisher=

work page 2000
[50]

Annual Review of Control, Robotics, and Autonomous Systems , volume=

Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies , author=. Annual Review of Control, Robotics, and Autonomous Systems , volume=. 2023 , publisher=

work page 2023
[51]

Uniform in time weak propagation of chaos on the

Delarue, Fran. Uniform in time weak propagation of chaos on the

work page
[52]

Methods of nonlinear analysis: applications to differential equations , author =

work page
[53]

A weak convergence approach to the theory of large deviations , author =

work page
[54]

Stochastic policy gradient methods: Improved sample complexity for

Fatkhullin, Ilyas and Barakat, Anas and Kireeva, Anastasia and He, Niao , year = 2023, journal =. Stochastic policy gradient methods: Improved sample complexity for

work page 2023
[55]

International Conference on Machine Learning , pages =

Global convergence of policy gradient methods for the linear quadratic regulator , author =. International Conference on Machine Learning , pages =

work page
[56]

Real analysis: modern techniques and their applications , author =

work page
[57]

Taming the noise in reinforcement learning via soft updates , author =

work page
[58]

Zeitschrift f

A certain class of diffusion processes associated with nonlinear parabolic equations , author =. Zeitschrift f

work page
[59]

Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=

On principled entropy exploration in policy optimization , author=. Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=

work page
[60]

International Conference on Machine Learning , pages =

Reinforcement learning with deep energy-based policies , author =. International Conference on Machine Learning , pages =

work page
[61]

International Conference on Machine Learning , pages =

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author =. International Conference on Machine Learning , pages =

work page
[62]

Advances in neural information processing systems , volume=

Actor-critic algorithms , author=. Advances in neural information processing systems , volume=

work page
[63]

Mathematische Nachrichten , volume=

On the lattice structure of kernel operators , author=. Mathematische Nachrichten , volume=. 2015 , publisher=

work page 2015
[64]

Automatica , volume=

Natural actor--critic algorithms , author=. Automatica , volume=. 2009 , publisher=

work page 2009
[65]

A topological property of real analytic subsets , author=. Coll. du CNRS, Les

work page
[66]

Zhurnal vychislitel’noi matematiki i matematicheskoi fiziki , volume=

Gradient methods for minimizing functionals , author=. Zhurnal vychislitel’noi matematiki i matematicheskoi fiziki , volume=

work page
[67]

Annales de l'Institut Fourier , pages =

Kurdyka, Krzysztof , title =. Annales de l'Institut Fourier , pages =. 1998 , doi =

work page 1998
[68]

Hammersley, William R. P. and. Mc. Annales de l'Institut Henri Poincar\'

work page
[69]

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators , author =

work page
[70]

Discrete-time Markov control processes: basic optimality criteria , author =

work page
[71]

Mean-field

Hu, Kaitong and Kazeykina, Anna and Ren, Zhenjie , year = 2019, journal =. Mean-field

work page 2019
[72]

Mean-field

Hu, Kaitong and Ren, Zhenjie and. Mean-field. Annales de l'Institut Henri Poincare (B) Probabilites et statistiques , volume = 57, number = 4, pages =

work page
[73]

Mean-field

Hu, Kaitong and Ren, Zhenjie and. Mean-field. Annales de l'Institut Henri Poincar

work page
[74]

Frontiers of Mathematics in China , publisher =

Distribution dependent stochastic differential equations , author =. Frontiers of Mathematics in China , publisher =

work page
[75]

Mean-field neural

Jabir, Jean-Fran. Mean-field neural

work page
[76]

Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes , author =

work page
[77]

The variational formulation of the

Jordan, Richard and Kinderlehrer, David and Otto, Felix , year = 1998, journal =. The variational formulation of the

work page 1998
[78]

Proceedings of the Nineteenth International Conference on Machine Learning , pages =

Approximately optimal approximate reinforcement learning , author =. Proceedings of the Nineteenth International Conference on Machine Learning , pages =

work page
[79]

A natural policy gradient , author =

work page
[80]

2021 60th IEEE Conference on Decision and Control (CDC) , pages =

On the linear convergence of natural policy gradient algorithm , author =. 2021 60th IEEE Conference on Decision and Control (CDC) , pages =

work page 2021

Showing first 80 references.

[1] [1]

Linear convergence of proximal descent schemes on the

Lascu, Razvan-Andrei and Majka, Mateusz B and. Linear convergence of proximal descent schemes on the. arXiv preprint arXiv:2411.15067 , year=

work page arXiv

[2] [2]

arXiv preprint arXiv:2505.00663v1 , year=

Wasserstein Policy Optimization , author=. arXiv preprint arXiv:2505.00663v1 , year=

work page arXiv

[3] [3]

Kerimkulov, Bekzhan and Leahy, James-Michael and Siska, David and Szpruch, Lukasz and Zhang, Yufei , journal=. A. 2025 , publisher=

work page 2025

[4] [4]

1986 , publisher=

Logarithmic Sobolev inequalities and stochastic Ising models , author=. 1986 , publisher=

work page 1986

[5] [5]

Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94) , volume=

Reinforcement learning in continuous time: Advantage updating , author=. Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94) , volume=. 1994 , organization=

work page 1994

[6] [6]

Making deep

Tallec, Corentin and Blier, L. Making deep. International Conference on Machine Learning , pages=. 2019 , organization=

work page 2019

[7] [7]

arXiv preprint arXiv:2202.01009 , year=

Mean-field langevin dynamics: Exponential convergence and annealing , author=. arXiv preprint arXiv:2202.01009 , year=

work page arXiv

[8] [8]

Convex analysis of the mean field

Nitanda, Atsushi and Wu, Denny and Suzuki, Taiji , booktitle=. Convex analysis of the mean field. 2022 , organization=

work page 2022

[9] [9]

arXiv preprint arXiv:2105.08368 , year=

Convergence rates of gradient methods for convex optimization in the space of measures , author=. arXiv preprint arXiv:2105.08368 , year=

work page arXiv

[10] [10]

CS Dept., UW Seattle, Seattle, WA, USA, Tech

Reinforcement learning: Theory and algorithms , author=. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep , volume=

work page

[11] [11]

Systems & Control Letters , volume=

Remarks on input to state stability of perturbed gradient flows, motivated by model-free feedback control learning , author=. Systems & Control Letters , volume=. 2022 , publisher=

work page 2022

[12] [12]

arXiv preprint arXiv:2211.00617 , year=

Convergence of policy gradient methods for finite-horizon stochastic linear-quadratic control problems , author=. arXiv preprint arXiv:2211.00617 , year=

work page arXiv

[13] [13]

2016 , publisher=

Information geometry and its applications , author=. 2016 , publisher=

work page 2016

[14] [14]

Optimal transport for applied mathematicians , author=. Birk. 2015 , publisher=

work page 2015

[15] [15]

Gallou. A. SIAM Journal on Mathematical Analysis , volume=. 2017 , publisher=

work page 2017

[16] [16]

Neural computation , volume=

Natural gradient works efficiently in learning , author=. Neural computation , volume=. 1998 , publisher=

work page 1998

[17] [17]

Gradient flows for regularized stochastic control problems , author =

work page

[18] [18]

On linear and super-linear convergence of Natural Policy Gradient algorithm , journal =

Sajad Khodadadian and Prakirt Raj Jhunjhunwala and Sushil Mahavir Varma and Siva Theja Maguluri , keywords =. On linear and super-linear convergence of Natural Policy Gradient algorithm , journal =. 2022 , issn =. doi:https://doi.org/10.1016/j.sysconle.2022.105214 , url =

work page doi:10.1016/j.sysconle.2022.105214 2022

[19] [19]

arXiv preprint arXiv:2308.07591 , year=

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria , author=. arXiv preprint arXiv:2308.07591 , year=

work page arXiv

[20] [20]

Optimality and approximation with policy gradient methods in

Agarwal, Alekh and Kakade, Sham M and Lee, Jason D and Mahajan, Gaurav , year = 2020, booktitle =. Optimality and approximation with policy gradient methods in

work page 2020

[21] [21]

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , author =. J. Mach. Learn. Res. , volume = 22, number = 98, pages =

work page

[22] [22]

Global optimality of softmax policy gradient with single hidden layer neural networks in the mean-field regime , author =

work page

[23] [23]

Global Optimality Of Softmax Policy Gradient With Single Hidden Layer Neural Networks In The Mean-field Regime , author =

work page

[24] [24]

Linear convergence for natural policy gradient with log-linear policy parametrization , author =

work page

[25] [25]

Infinite Dimensional Analysis:

Aliprantis, Charalambos D and Border, Kim C , year = 2006, publisher =. Infinite Dimensional Analysis:

work page 2006

[26] [26]

Methods of information geometry , author =

work page

[27] [27]

Gradient flows: in metric spaces and in the space of probability measures , author =

work page

[28] [28]

Mirror descent with relative smoothness in measure spaces, with application to

Aubin-Frankowski, Pierre-Cyril and Korba, Anna and L. Mirror descent with relative smoothness in measure spaces, with application to. Advances in Neural Information Processing Systems , volume = 35, pages =

work page

[29] [29]

Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space , author =

work page

[30] [30]

First-order methods in optimization , author =

work page

[31] [31]

International Conference on Machine Learning , pages =

On the hidden biases of policy mirror ascent in continuous action spaces , author =. International Conference on Machine Learning , pages =

work page

[32] [32]

On the sample complexity and metastability of heavy-tailed policy search in continuous control , author =

work page

[33] [33]

Stochastic optimal control: the discrete-time case , author =

work page

[34] [34]

Global optimality guarantees for policy gradient methods , author =

work page

[35] [35]

2019 IEEE 58th Conference on Decision and Control (CDC) , pages =

Policy gradient using weak derivatives for reinforcement learning , author =. 2019 IEEE 58th Conference on Decision and Control (CDC) , pages =

work page 2019

[36] [36]

Distances between Stationary Distributions of Diffusions and Solvability of Nonlinear

Bogachev, Vladimir I and Kirillov, Andrei I and Shaposhnikov, Stanislav V , year = 2018, journal =. Distances between Stationary Distributions of Diffusions and Solvability of Nonlinear

work page 2018

[37] [37]

Bogachev, Vladimir I and Krylov, Nicolai V and R

work page

[38] [38]

Convergence in variation of solutions of nonlinear

Bogachev, Vladimir I and R. Convergence in variation of solutions of nonlinear. Journal of Functional Analysis , publisher =

work page

[39] [39]

Distances between transition probabilities of diffusions and applications to nonlinear

Bogachev, Vladimir I and R. Distances between transition probabilities of diffusions and applications to nonlinear. Journal of Functional Analysis , publisher =

work page

[40] [40]

Perturbation analysis of optimization problems , author =

work page

[41] [41]

Functional analysis, Sobolev spaces and partial differential equations , author =

work page

[42] [42]

On ergodic properties of nonlinear

Butkovsky, Oleg A , year = 2014, journal =. On ergodic properties of nonlinear

work page 2014

[43] [43]

Probabilistic Theory of Mean Field Games with Applications I-II , author =

work page

[44] [44]

Annual Review of Control, Robotics, and Autonomous Systems , volume=

Adaptive Control and Intersections with Reinforcement Learning , author=. Annual Review of Control, Robotics, and Autonomous Systems , volume=. 2023 , publisher=

work page 2023

[45] [45]

Linear convergence of entropy-regularized natural policy gradient with linear function approximation , author =

work page

[46] [46]

Operations Research , publisher =

Fast global convergence of natural policy gradient methods with entropy regularization , author =. Operations Research , publisher =

work page

[47] [47]

Advances in neural information processing systems , pages =

On the global convergence of gradient descent for over-parameterized models using optimal transport , author =. Advances in neural information processing systems , pages =

work page

[48] [48]

Linear and nonlinear functional analysis with applications , author =

work page

[49] [49]

Neural computation , volume=

Reinforcement learning in continuous time and space , author=. Neural computation , volume=. 2000 , publisher=

work page 2000

[50] [50]

Annual Review of Control, Robotics, and Autonomous Systems , volume=

Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies , author=. Annual Review of Control, Robotics, and Autonomous Systems , volume=. 2023 , publisher=

work page 2023

[51] [51]

Uniform in time weak propagation of chaos on the

Delarue, Fran. Uniform in time weak propagation of chaos on the

work page

[52] [52]

Methods of nonlinear analysis: applications to differential equations , author =

work page

[53] [53]

A weak convergence approach to the theory of large deviations , author =

work page

[54] [54]

Stochastic policy gradient methods: Improved sample complexity for

Fatkhullin, Ilyas and Barakat, Anas and Kireeva, Anastasia and He, Niao , year = 2023, journal =. Stochastic policy gradient methods: Improved sample complexity for

work page 2023

[55] [55]

International Conference on Machine Learning , pages =

Global convergence of policy gradient methods for the linear quadratic regulator , author =. International Conference on Machine Learning , pages =

work page

[56] [56]

Real analysis: modern techniques and their applications , author =

work page

[57] [57]

Taming the noise in reinforcement learning via soft updates , author =

work page

[58] [58]

Zeitschrift f

A certain class of diffusion processes associated with nonlinear parabolic equations , author =. Zeitschrift f

work page

[59] [59]

Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=

On principled entropy exploration in policy optimization , author=. Proceedings of the 28th International Joint Conference on Artificial Intelligence , pages=

work page

[60] [60]

International Conference on Machine Learning , pages =

Reinforcement learning with deep energy-based policies , author =. International Conference on Machine Learning , pages =

work page

[61] [61]

International Conference on Machine Learning , pages =

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author =. International Conference on Machine Learning , pages =

work page

[62] [62]

Advances in neural information processing systems , volume=

Actor-critic algorithms , author=. Advances in neural information processing systems , volume=

work page

[63] [63]

Mathematische Nachrichten , volume=

On the lattice structure of kernel operators , author=. Mathematische Nachrichten , volume=. 2015 , publisher=

work page 2015

[64] [64]

Automatica , volume=

Natural actor--critic algorithms , author=. Automatica , volume=. 2009 , publisher=

work page 2009

[65] [65]

A topological property of real analytic subsets , author=. Coll. du CNRS, Les

work page

[66] [66]

Zhurnal vychislitel’noi matematiki i matematicheskoi fiziki , volume=

Gradient methods for minimizing functionals , author=. Zhurnal vychislitel’noi matematiki i matematicheskoi fiziki , volume=

work page

[67] [67]

Annales de l'Institut Fourier , pages =

Kurdyka, Krzysztof , title =. Annales de l'Institut Fourier , pages =. 1998 , doi =

work page 1998

[68] [68]

Hammersley, William R. P. and. Mc. Annales de l'Institut Henri Poincar\'

work page

[69] [69]

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators , author =

work page

[70] [70]

Discrete-time Markov control processes: basic optimality criteria , author =

work page

[71] [71]

Mean-field

Hu, Kaitong and Kazeykina, Anna and Ren, Zhenjie , year = 2019, journal =. Mean-field

work page 2019

[72] [72]

Mean-field

Hu, Kaitong and Ren, Zhenjie and. Mean-field. Annales de l'Institut Henri Poincare (B) Probabilites et statistiques , volume = 57, number = 4, pages =

work page

[73] [73]

Mean-field

Hu, Kaitong and Ren, Zhenjie and. Mean-field. Annales de l'Institut Henri Poincar

work page

[74] [74]

Frontiers of Mathematics in China , publisher =

Distribution dependent stochastic differential equations , author =. Frontiers of Mathematics in China , publisher =

work page

[75] [75]

Mean-field neural

Jabir, Jean-Fran. Mean-field neural

work page

[76] [76]

Optimal Convergence Rate for Exact Policy Mirror Descent in Discounted Markov Decision Processes , author =

work page

[77] [77]

The variational formulation of the

Jordan, Richard and Kinderlehrer, David and Otto, Felix , year = 1998, journal =. The variational formulation of the

work page 1998

[78] [78]

Proceedings of the Nineteenth International Conference on Machine Learning , pages =

Approximately optimal approximate reinforcement learning , author =. Proceedings of the Nineteenth International Conference on Machine Learning , pages =

work page

[79] [79]

A natural policy gradient , author =

work page

[80] [80]

2021 60th IEEE Conference on Decision and Control (CDC) , pages =

On the linear convergence of natural policy gradient algorithm , author =. 2021 60th IEEE Conference on Decision and Control (CDC) , pages =

work page 2021