Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization

Mathias Dus (IRMA)

arxiv: 2604.14765 · v1 · submitted 2026-04-16 · 💻 cs.LG · math.OC· math.PR

Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization

Mathias Dus (IRMA) This is my paper

Pith reviewed 2026-05-10 12:08 UTC · model grok-4.3

classification 💻 cs.LG math.OCmath.PR

keywords reinforcement learningWasserstein spaceoptimal transportpolicy optimizationgradient flowRiemannian structureOtto calculusstationary distributions

0 comments

The pith

Reinforcement learning policies are mapped into Wasserstein space so that policy optimization becomes a gradient flow with explicit second-order structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines policies as functions from states to probability measures over actions and equips the resulting space with a Riemannian metric drawn from the environment's stationary distributions. This metric makes it possible to apply Otto calculus and construct a gradient flow for a general reinforcement learning objective. The gradient and Hessian of the associated energy functional are derived directly, supplying a formal second-order analysis that standard first-order methods lack. Numerical examples confirm that the flow can be computed exactly in low dimensions and approximated via neural networks in higher dimensions.

Core claim

By viewing policies as maps into the Wasserstein space of action probabilities and inducing a Riemannian structure from stationary distributions, a general RL optimization problem admits a gradient flow whose direction and curvature are given explicitly by Otto calculus; the resulting gradient and Hessian provide the first- and second-order information needed to optimize policies in both low- and high-dimensional settings.

What carries the argument

The Riemannian structure induced by stationary distributions on the space of policies, which turns the tangent space of action probability measures into a metric space supporting geodesics and gradient flows.

If this is right

Any RL objective that can be written as an energy on policy space now possesses a well-defined gradient flow.
The Hessian supplies curvature information that can be used for accelerated or Newton-style updates.
Low-dimensional problems allow exact gradient computation without sampling approximations.
High-dimensional problems remain tractable by parameterizing the policy with a neural network and using an ergodic average of the cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric construction may yield continuous-time limits of common discrete RL algorithms.
Convergence rates could be compared against standard policy-gradient methods on shared benchmark tasks.
Relaxing the stationary-distribution assumption might extend the framework to non-stationary or episodic settings.
Links to other optimal-transport applications in control could produce hybrid algorithms that move both states and actions in Wasserstein space.

Load-bearing premise

A Riemannian metric on the policy space can be defined from stationary distributions for arbitrary environments, and the vector fields that map states to tangent vectors remain measurable.

What would settle it

In a low-dimensional environment where the optimal policy is known analytically, compute the gradient flow trajectory directly from the derived formulas and check whether it converges to that known optimum.

Figures

Figures reproduced from arXiv: 2604.14765 by Mathias Dus (IRMA).

**Figure 3.** Figure 3: Simulated trajectory. Top: State evolution with and without noise. Bottom: Control actions [PITH_FULL_IMAGE:figures/full_fig_p031_3.png] view at source ↗

**Figure 4.** Figure 4: Grid-based results. Left: Value Function heatmap. Right: Policy mean action heatmap. [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗

**Figure 5.** Figure 5: Convergence of Average Cost (Grid) [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗

**Figure 7.** Figure 7: Direct Diff. Physics: Trajectory and Control [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 10.** Figure 10: Direct Diff. Physics: Trajectory and Control [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 14.** Figure 14: Joint Training: Policy Loss (left) vs [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: Direct Differentiable Physics: State trajectories (top) and Control inputs (bottom). [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

**Figure 16.** Figure 16: Training Loss for Direct Differentiable Physics. [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗

**Figure 17.** Figure 17: World Model Approach: State trajectories (left) and Control inputs (right). [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Joint Training Losses: Policy Loss (left) and World Model Prediction Loss (right). [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗

read the original abstract

We present a geometric framework for Reinforcement Learning (RL) that views policies as maps into the Wasserstein space of action probabilities. First, we define a Riemannian structure induced by stationary distributions, proving its existence in a general context. We then define the tangent space of policies and characterize the geodesics, specifically addressing the measurability of vector fields mapped from the state space to the tangent space of probability measures over the action space. Next, we formulate a general RL optimization problem and construct a gradient flow using Otto's calculus. We compute the gradient and the Hessian of the energy, providing a formal second-order analysis. Finally, we illustrate the method with numerical examples for low-dimensional problems, computing the gradient directly from our theoretical formalism. For high-dimensional problems, we parameterize the policy using a neural network and optimize it based on an ergodic approximation of the cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper recasts RL policy optimization as a gradient flow on a Wasserstein Riemannian manifold of policies using Otto calculus, but the existence and measurability claims for general MDPs look under-supported.

read the letter

The main contribution is a geometric reformulation that treats policies as maps into the Wasserstein space of action distributions, equips policy space with a Riemannian metric induced by stationary distributions, and then runs an Otto-calculus gradient flow on the RL objective. They also characterize geodesics, claim to compute the gradient and Hessian explicitly, and give low-dimensional numerical checks plus a neural-net version with ergodic cost approximation for bigger problems. That combination of stationary-distribution metric, explicit second-order analysis, and transport tools is not standard in the usual policy-gradient literature, so the framing itself is distinct. The low-dimensional examples follow the formalism directly and the high-dimensional workaround is pragmatic. The paper does engage the literature on Otto calculus and stationary measures without obvious circularity. The soft spots sit in the foundations. The Riemannian structure and the measurability of the state-to-tangent vector fields are asserted to exist in a general context, yet the stress-test concern is real: without extra regularity such as uniform ergodicity or Lipschitz transitions, stationary measures can be zero or non-unique and the inner product can degenerate, especially in continuous or unbounded spaces. If those conditions are not stated sharply in the proofs, the gradient flow and Hessian are not defined on the full policy space the authors want to work with. The abstract supplies no proof sketches or quantitative results, so the support for the central claims remains thin until the full text is checked. This is for RL theorists who already work with optimal transport or geometric views of probability spaces. A reader interested in new analysis tools for policy optimization could extract ideas, but only after verifying the existence arguments. It is worth sending to a serious referee to test the measurability and non-degeneracy steps and to see whether the numerics improve on standard methods.

Referee Report

2 major / 2 minor

Summary. The paper develops a geometric framework for reinforcement learning by viewing policies as maps from states to the Wasserstein space of action probability measures. It defines a Riemannian structure on the policy space induced by stationary distributions (claiming existence in general MDPs), characterizes the tangent space and geodesics while addressing measurability of state-to-tangent vector fields, formulates RL as energy minimization, constructs the gradient flow via Otto calculus, derives the gradient and Hessian of the energy for second-order analysis, and illustrates the method with direct gradient computation on low-dimensional problems and neural-network parameterization with ergodic approximation on high-dimensional ones.

Significance. If the core constructions are rigorous, the work offers a novel optimal-transport perspective on policy optimization that could unify geometric methods with RL, with the formal second-order (Hessian) analysis providing a clear strength for analyzing convergence. The use of Otto calculus for the gradient flow is a technically interesting contribution, though the numerical examples remain illustrative rather than comparative.

major comments (2)

[§2] §2 (Riemannian structure induced by stationary distributions): The existence claim for the Riemannian metric in a general context is load-bearing for the entire gradient-flow construction, yet the manuscript supplies no explicit regularity conditions (e.g., uniform ergodicity, positivity of the stationary measure, or Lipschitz continuity of the transition kernel) that guarantee the induced inner product is well-defined and non-degenerate for arbitrary policies in continuous or unbounded spaces; without these, the Otto-calculus gradient and Hessian are undefined on the full policy space.
[§3] §3 (tangent space and measurability of vector fields): The characterization of measurable vector fields mapping states to the tangent space of action-probability measures is essential for the gradient flow to be rigorously defined, but the provided arguments do not state sufficient conditions ensuring measurability is preserved under the stationary-distribution weighting; this directly affects whether the derived gradient and Hessian are valid operators.

minor comments (2)

[§5] §5 (numerical examples): The low-dimensional illustrations compute the gradient from the formalism but report no quantitative metrics (convergence rates, regret, or baseline comparisons), weakening the empirical support for the theoretical claims.
[Abstract and §4] Abstract and §4: The high-dimensional case relies on an 'ergodic approximation' of the cost without specifying the approximation error or its effect on the Hessian analysis; a brief error bound would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. We address the two major comments point by point below, acknowledging where additional rigor is required and outlining the planned revisions.

read point-by-point responses

Referee: [§2] §2 (Riemannian structure induced by stationary distributions): The existence claim for the Riemannian metric in a general context is load-bearing for the entire gradient-flow construction, yet the manuscript supplies no explicit regularity conditions (e.g., uniform ergodicity, positivity of the stationary measure, or Lipschitz continuity of the transition kernel) that guarantee the induced inner product is well-defined and non-degenerate for arbitrary policies in continuous or unbounded spaces; without these, the Otto-calculus gradient and Hessian are undefined on the full policy space.

Authors: We agree that the existence of the Riemannian metric induced by stationary distributions requires explicit regularity conditions to be rigorously established for general MDPs, especially in continuous or unbounded spaces. Although the manuscript asserts a proof of existence in a general context, it does not enumerate the necessary assumptions (such as uniform ergodicity, positivity of the stationary measure, or Lipschitz continuity of the transition kernel) that ensure the inner product is well-defined and non-degenerate. In the revised version we will add a dedicated subsection in §2 that states these minimal conditions and shows how they guarantee the metric properties, thereby justifying the subsequent application of Otto calculus and the gradient/Hessian derivations. revision: yes
Referee: [§3] §3 (tangent space and measurability of vector fields): The characterization of measurable vector fields mapping states to the tangent space of action-probability measures is essential for the gradient flow to be rigorously defined, but the provided arguments do not state sufficient conditions ensuring measurability is preserved under the stationary-distribution weighting; this directly affects whether the derived gradient and Hessian are valid operators.

Authors: We concur that the arguments concerning measurability of state-to-tangent vector fields need to be strengthened by explicitly stating sufficient conditions under which measurability is preserved when the fields are weighted by the stationary distribution. The manuscript addresses measurability in §3, yet the conditions (for instance, joint measurability with respect to the product sigma-algebra and positivity of the stationary measure) are not stated with sufficient clarity. We will revise the relevant paragraphs in §3 to include these conditions, ensuring that the tangent-space characterization, gradient flow, and the derived gradient and Hessian operators are rigorously valid. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation builds on external Otto calculus and independent existence proof

full rationale

The paper first proves existence of the Riemannian structure induced by stationary distributions in a general context, then applies Otto's calculus to define the gradient flow on the policy space. Gradient and Hessian computations follow directly from the energy functional and tangent-space characterization without any reduction to fitted parameters, self-definitions, or self-citation chains. The measurability of vector fields is addressed as part of the tangent-space construction rather than presupposed as an output. No step equates a derived quantity to its own input by construction, and the framework remains self-contained against the cited external optimal-transport machinery.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central constructions rest on the existence of a Riemannian structure induced by stationary distributions and on measurability of the lifted vector fields; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Existence of a Riemannian structure on the space of policies induced by stationary distributions
Stated as proved in a general context before defining tangent spaces and geodesics.
domain assumption Measurability of vector fields from state space to tangent space of action probability measures
Explicitly addressed as a technical requirement for the geometric constructions.

pith-pipeline@v0.9.0 · 5443 in / 1419 out tokens · 45738 ms · 2026-05-10T12:08:53.672611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Springer, 3rd edition, 2006

Charalambos D Aliprantis and Kim C Border.Infinite dimensional analysis: a hitchhiker’s guide. Springer, 3rd edition, 2006

work page 2006
[2]

Springer Science & Business Media, 2008

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré.Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008

work page 2008
[3]

Lipschitz continuity in model-based re- inforcement learning

Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in model-based re- inforcement learning. In International Conference on Machine Learning (ICML), pages 264–273. PMLR, 2018

work page 2018
[4]

A computational fluid mechanics solution to the monge- kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge- kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

work page 2000
[5]

Cambridge University Press, Cambridge, 1996

Giuseppe Da Prato and Jerzy Zabczyk.Ergodicity for Infinite Dimensional Systems, volume 229 of London Mathematical Society Lecture Note Series. Cambridge University Press, Cambridge, 1996

work page 1996
[6]

Cambridge University Press, 5th edition, 2019

Rick Durrett.Probability: Theory and Examples. Cambridge University Press, 5th edition, 2019

work page 2019
[7]

Ergodic properties of markov processes

Martin Hairer. Ergodic properties of markov processes. Lecture notes, University of Warwick, 2006

work page 2006
[8]

Birkhäuser, 2003

Onésimo Hernández-Lerma and Jean Bernard Lasserre.Markov chains and invariant probabilities, volume 211. Birkhäuser, 2003

work page 2003
[9]

A natural policy gradient

Sham M Kakade. A natural policy gradient. InAdvances in neural information processing systems (NeurIPS), pages 1531–1538, 2001

work page 2001
[10]

Koralov and Yakov G

Leonid B. Koralov and Yakov G. Sinai.Theory of Probability and Random Processes. Springer Berlin Heidelberg, 2nd edition, 2007

work page 2007
[11]

Meyn and Richard L

Sean P. Meyn and Richard L. Tweedie.Markov Chains and Stochastic Stability. Cambridge Uni- versity Press, 2nd edition, 2009

work page 2009
[12]

The geometry of dissipative evolution equations: the porous medium equation.Com- munications in Partial Differential Equations, 26(1-2):101–174, 2001

Felix Otto. The geometry of dissipative evolution equations: the porous medium equation.Com- munications in Partial Differential Equations, 26(1-2):101–174, 2001

work page 2001
[13]

Learning to score behaviors for guided policy optimization.arXiv preprint arXiv:2006.00000, 2020

Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Krzysztof Choromanski, Anna Choromanska, and Michael I Jordan. Learning to score behaviors for guided policy optimization.arXiv preprint arXiv:2006.00000, 2020

work page arXiv 2006
[14]

David Pfau, Ian Davies, Diana Borsa, Joao G. M. Araujo, Brendan Tracey, and Hado van Hasselt. Wasserstein policy optimization. 2025

work page 2025
[15]

Policy gradient in lipschitz markov decision processes

Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in lipschitz markov decision processes. Machine Learning, 100(2-3):255–283, 2015

work page 2015
[16]

Springer, 2015

Filippo Santambrogio.Optimal transport for applied mathematicians, volume 55. Springer, 2015

work page 2015
[17]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning (ICML), pages 1889–1897. PMLR, 2015

work page 2015
[18]

Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften

Cédric Villani. Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften. Springer, Berlin, Heidelberg, 2009

work page 2009
[19]

Policy optimization as wasserstein gradient flows

Ruiqi Zhang, Chen Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient flows. In International Conference on Machine Learning (ICML), pages 12400–12410. PMLR, 2021

work page 2021
[20]

Wasserstein gradient flows for optimizing gaussian mixture policies

Hanna Ziesche and Leonel Rozo. Wasserstein gradient flows for optimizing gaussian mixture policies. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 35

work page 2023

[1] [1]

Springer, 3rd edition, 2006

Charalambos D Aliprantis and Kim C Border.Infinite dimensional analysis: a hitchhiker’s guide. Springer, 3rd edition, 2006

work page 2006

[2] [2]

Springer Science & Business Media, 2008

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré.Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008

work page 2008

[3] [3]

Lipschitz continuity in model-based re- inforcement learning

Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in model-based re- inforcement learning. In International Conference on Machine Learning (ICML), pages 264–273. PMLR, 2018

work page 2018

[4] [4]

A computational fluid mechanics solution to the monge- kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge- kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

work page 2000

[5] [5]

Cambridge University Press, Cambridge, 1996

Giuseppe Da Prato and Jerzy Zabczyk.Ergodicity for Infinite Dimensional Systems, volume 229 of London Mathematical Society Lecture Note Series. Cambridge University Press, Cambridge, 1996

work page 1996

[6] [6]

Cambridge University Press, 5th edition, 2019

Rick Durrett.Probability: Theory and Examples. Cambridge University Press, 5th edition, 2019

work page 2019

[7] [7]

Ergodic properties of markov processes

Martin Hairer. Ergodic properties of markov processes. Lecture notes, University of Warwick, 2006

work page 2006

[8] [8]

Birkhäuser, 2003

Onésimo Hernández-Lerma and Jean Bernard Lasserre.Markov chains and invariant probabilities, volume 211. Birkhäuser, 2003

work page 2003

[9] [9]

A natural policy gradient

Sham M Kakade. A natural policy gradient. InAdvances in neural information processing systems (NeurIPS), pages 1531–1538, 2001

work page 2001

[10] [10]

Koralov and Yakov G

Leonid B. Koralov and Yakov G. Sinai.Theory of Probability and Random Processes. Springer Berlin Heidelberg, 2nd edition, 2007

work page 2007

[11] [11]

Meyn and Richard L

Sean P. Meyn and Richard L. Tweedie.Markov Chains and Stochastic Stability. Cambridge Uni- versity Press, 2nd edition, 2009

work page 2009

[12] [12]

The geometry of dissipative evolution equations: the porous medium equation.Com- munications in Partial Differential Equations, 26(1-2):101–174, 2001

Felix Otto. The geometry of dissipative evolution equations: the porous medium equation.Com- munications in Partial Differential Equations, 26(1-2):101–174, 2001

work page 2001

[13] [13]

Learning to score behaviors for guided policy optimization.arXiv preprint arXiv:2006.00000, 2020

Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Krzysztof Choromanski, Anna Choromanska, and Michael I Jordan. Learning to score behaviors for guided policy optimization.arXiv preprint arXiv:2006.00000, 2020

work page arXiv 2006

[14] [14]

David Pfau, Ian Davies, Diana Borsa, Joao G. M. Araujo, Brendan Tracey, and Hado van Hasselt. Wasserstein policy optimization. 2025

work page 2025

[15] [15]

Policy gradient in lipschitz markov decision processes

Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in lipschitz markov decision processes. Machine Learning, 100(2-3):255–283, 2015

work page 2015

[16] [16]

Springer, 2015

Filippo Santambrogio.Optimal transport for applied mathematicians, volume 55. Springer, 2015

work page 2015

[17] [17]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning (ICML), pages 1889–1897. PMLR, 2015

work page 2015

[18] [18]

Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften

Cédric Villani. Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften. Springer, Berlin, Heidelberg, 2009

work page 2009

[19] [19]

Policy optimization as wasserstein gradient flows

Ruiqi Zhang, Chen Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient flows. In International Conference on Machine Learning (ICML), pages 12400–12410. PMLR, 2021

work page 2021

[20] [20]

Wasserstein gradient flows for optimizing gaussian mixture policies

Hanna Ziesche and Leonel Rozo. Wasserstein gradient flows for optimizing gaussian mixture policies. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 35

work page 2023