Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization
Pith reviewed 2026-05-10 12:08 UTC · model grok-4.3
The pith
Reinforcement learning policies are mapped into Wasserstein space so that policy optimization becomes a gradient flow with explicit second-order structure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By viewing policies as maps into the Wasserstein space of action probabilities and inducing a Riemannian structure from stationary distributions, a general RL optimization problem admits a gradient flow whose direction and curvature are given explicitly by Otto calculus; the resulting gradient and Hessian provide the first- and second-order information needed to optimize policies in both low- and high-dimensional settings.
What carries the argument
The Riemannian structure induced by stationary distributions on the space of policies, which turns the tangent space of action probability measures into a metric space supporting geodesics and gradient flows.
If this is right
- Any RL objective that can be written as an energy on policy space now possesses a well-defined gradient flow.
- The Hessian supplies curvature information that can be used for accelerated or Newton-style updates.
- Low-dimensional problems allow exact gradient computation without sampling approximations.
- High-dimensional problems remain tractable by parameterizing the policy with a neural network and using an ergodic average of the cost.
Where Pith is reading between the lines
- The same geometric construction may yield continuous-time limits of common discrete RL algorithms.
- Convergence rates could be compared against standard policy-gradient methods on shared benchmark tasks.
- Relaxing the stationary-distribution assumption might extend the framework to non-stationary or episodic settings.
- Links to other optimal-transport applications in control could produce hybrid algorithms that move both states and actions in Wasserstein space.
Load-bearing premise
A Riemannian metric on the policy space can be defined from stationary distributions for arbitrary environments, and the vector fields that map states to tangent vectors remain measurable.
What would settle it
In a low-dimensional environment where the optimal policy is known analytically, compute the gradient flow trajectory directly from the derived formulas and check whether it converges to that known optimum.
Figures
read the original abstract
We present a geometric framework for Reinforcement Learning (RL) that views policies as maps into the Wasserstein space of action probabilities. First, we define a Riemannian structure induced by stationary distributions, proving its existence in a general context. We then define the tangent space of policies and characterize the geodesics, specifically addressing the measurability of vector fields mapped from the state space to the tangent space of probability measures over the action space. Next, we formulate a general RL optimization problem and construct a gradient flow using Otto's calculus. We compute the gradient and the Hessian of the energy, providing a formal second-order analysis. Finally, we illustrate the method with numerical examples for low-dimensional problems, computing the gradient directly from our theoretical formalism. For high-dimensional problems, we parameterize the policy using a neural network and optimize it based on an ergodic approximation of the cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a geometric framework for reinforcement learning by viewing policies as maps from states to the Wasserstein space of action probability measures. It defines a Riemannian structure on the policy space induced by stationary distributions (claiming existence in general MDPs), characterizes the tangent space and geodesics while addressing measurability of state-to-tangent vector fields, formulates RL as energy minimization, constructs the gradient flow via Otto calculus, derives the gradient and Hessian of the energy for second-order analysis, and illustrates the method with direct gradient computation on low-dimensional problems and neural-network parameterization with ergodic approximation on high-dimensional ones.
Significance. If the core constructions are rigorous, the work offers a novel optimal-transport perspective on policy optimization that could unify geometric methods with RL, with the formal second-order (Hessian) analysis providing a clear strength for analyzing convergence. The use of Otto calculus for the gradient flow is a technically interesting contribution, though the numerical examples remain illustrative rather than comparative.
major comments (2)
- [§2] §2 (Riemannian structure induced by stationary distributions): The existence claim for the Riemannian metric in a general context is load-bearing for the entire gradient-flow construction, yet the manuscript supplies no explicit regularity conditions (e.g., uniform ergodicity, positivity of the stationary measure, or Lipschitz continuity of the transition kernel) that guarantee the induced inner product is well-defined and non-degenerate for arbitrary policies in continuous or unbounded spaces; without these, the Otto-calculus gradient and Hessian are undefined on the full policy space.
- [§3] §3 (tangent space and measurability of vector fields): The characterization of measurable vector fields mapping states to the tangent space of action-probability measures is essential for the gradient flow to be rigorously defined, but the provided arguments do not state sufficient conditions ensuring measurability is preserved under the stationary-distribution weighting; this directly affects whether the derived gradient and Hessian are valid operators.
minor comments (2)
- [§5] §5 (numerical examples): The low-dimensional illustrations compute the gradient from the formalism but report no quantitative metrics (convergence rates, regret, or baseline comparisons), weakening the empirical support for the theoretical claims.
- [Abstract and §4] Abstract and §4: The high-dimensional case relies on an 'ergodic approximation' of the cost without specifying the approximation error or its effect on the Hessian analysis; a brief error bound would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review of our manuscript. We address the two major comments point by point below, acknowledging where additional rigor is required and outlining the planned revisions.
read point-by-point responses
-
Referee: [§2] §2 (Riemannian structure induced by stationary distributions): The existence claim for the Riemannian metric in a general context is load-bearing for the entire gradient-flow construction, yet the manuscript supplies no explicit regularity conditions (e.g., uniform ergodicity, positivity of the stationary measure, or Lipschitz continuity of the transition kernel) that guarantee the induced inner product is well-defined and non-degenerate for arbitrary policies in continuous or unbounded spaces; without these, the Otto-calculus gradient and Hessian are undefined on the full policy space.
Authors: We agree that the existence of the Riemannian metric induced by stationary distributions requires explicit regularity conditions to be rigorously established for general MDPs, especially in continuous or unbounded spaces. Although the manuscript asserts a proof of existence in a general context, it does not enumerate the necessary assumptions (such as uniform ergodicity, positivity of the stationary measure, or Lipschitz continuity of the transition kernel) that ensure the inner product is well-defined and non-degenerate. In the revised version we will add a dedicated subsection in §2 that states these minimal conditions and shows how they guarantee the metric properties, thereby justifying the subsequent application of Otto calculus and the gradient/Hessian derivations. revision: yes
-
Referee: [§3] §3 (tangent space and measurability of vector fields): The characterization of measurable vector fields mapping states to the tangent space of action-probability measures is essential for the gradient flow to be rigorously defined, but the provided arguments do not state sufficient conditions ensuring measurability is preserved under the stationary-distribution weighting; this directly affects whether the derived gradient and Hessian are valid operators.
Authors: We concur that the arguments concerning measurability of state-to-tangent vector fields need to be strengthened by explicitly stating sufficient conditions under which measurability is preserved when the fields are weighted by the stationary distribution. The manuscript addresses measurability in §3, yet the conditions (for instance, joint measurability with respect to the product sigma-algebra and positivity of the stationary measure) are not stated with sufficient clarity. We will revise the relevant paragraphs in §3 to include these conditions, ensuring that the tangent-space characterization, gradient flow, and the derived gradient and Hessian operators are rigorously valid. revision: yes
Circularity Check
No circularity: derivation builds on external Otto calculus and independent existence proof
full rationale
The paper first proves existence of the Riemannian structure induced by stationary distributions in a general context, then applies Otto's calculus to define the gradient flow on the policy space. Gradient and Hessian computations follow directly from the energy functional and tangent-space characterization without any reduction to fitted parameters, self-definitions, or self-citation chains. The measurability of vector fields is addressed as part of the tangent-space construction rather than presupposed as an output. No step equates a derived quantity to its own input by construction, and the framework remains self-contained against the cited external optimal-transport machinery.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Existence of a Riemannian structure on the space of policies induced by stationary distributions
- domain assumption Measurability of vector fields from state space to tangent space of action probability measures
Reference graph
Works this paper leans on
-
[1]
Charalambos D Aliprantis and Kim C Border.Infinite dimensional analysis: a hitchhiker’s guide. Springer, 3rd edition, 2006
work page 2006
-
[2]
Springer Science & Business Media, 2008
Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré.Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008
work page 2008
-
[3]
Lipschitz continuity in model-based re- inforcement learning
Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in model-based re- inforcement learning. In International Conference on Machine Learning (ICML), pages 264–273. PMLR, 2018
work page 2018
-
[4]
Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge- kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000
work page 2000
-
[5]
Cambridge University Press, Cambridge, 1996
Giuseppe Da Prato and Jerzy Zabczyk.Ergodicity for Infinite Dimensional Systems, volume 229 of London Mathematical Society Lecture Note Series. Cambridge University Press, Cambridge, 1996
work page 1996
-
[6]
Cambridge University Press, 5th edition, 2019
Rick Durrett.Probability: Theory and Examples. Cambridge University Press, 5th edition, 2019
work page 2019
-
[7]
Ergodic properties of markov processes
Martin Hairer. Ergodic properties of markov processes. Lecture notes, University of Warwick, 2006
work page 2006
-
[8]
Onésimo Hernández-Lerma and Jean Bernard Lasserre.Markov chains and invariant probabilities, volume 211. Birkhäuser, 2003
work page 2003
-
[9]
Sham M Kakade. A natural policy gradient. InAdvances in neural information processing systems (NeurIPS), pages 1531–1538, 2001
work page 2001
-
[10]
Leonid B. Koralov and Yakov G. Sinai.Theory of Probability and Random Processes. Springer Berlin Heidelberg, 2nd edition, 2007
work page 2007
-
[11]
Sean P. Meyn and Richard L. Tweedie.Markov Chains and Stochastic Stability. Cambridge Uni- versity Press, 2nd edition, 2009
work page 2009
-
[12]
Felix Otto. The geometry of dissipative evolution equations: the porous medium equation.Com- munications in Partial Differential Equations, 26(1-2):101–174, 2001
work page 2001
-
[13]
Learning to score behaviors for guided policy optimization.arXiv preprint arXiv:2006.00000, 2020
Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Krzysztof Choromanski, Anna Choromanska, and Michael I Jordan. Learning to score behaviors for guided policy optimization.arXiv preprint arXiv:2006.00000, 2020
-
[14]
David Pfau, Ian Davies, Diana Borsa, Joao G. M. Araujo, Brendan Tracey, and Hado van Hasselt. Wasserstein policy optimization. 2025
work page 2025
-
[15]
Policy gradient in lipschitz markov decision processes
Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in lipschitz markov decision processes. Machine Learning, 100(2-3):255–283, 2015
work page 2015
-
[16]
Filippo Santambrogio.Optimal transport for applied mathematicians, volume 55. Springer, 2015
work page 2015
-
[17]
Trust region policy optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning (ICML), pages 1889–1897. PMLR, 2015
work page 2015
-
[18]
Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften
Cédric Villani. Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften. Springer, Berlin, Heidelberg, 2009
work page 2009
-
[19]
Policy optimization as wasserstein gradient flows
Ruiqi Zhang, Chen Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient flows. In International Conference on Machine Learning (ICML), pages 12400–12410. PMLR, 2021
work page 2021
-
[20]
Wasserstein gradient flows for optimizing gaussian mixture policies
Hanna Ziesche and Leonel Rozo. Wasserstein gradient flows for optimizing gaussian mixture policies. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 35
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.