pith. sign in

arxiv: 2604.14765 · v1 · submitted 2026-04-16 · 💻 cs.LG · math.OC· math.PR

Wasserstein Formulation of Reinforcement Learning. An Optimal Transport Perspective on Policy Optimization

Pith reviewed 2026-05-10 12:08 UTC · model grok-4.3

classification 💻 cs.LG math.OCmath.PR
keywords reinforcement learningWasserstein spaceoptimal transportpolicy optimizationgradient flowRiemannian structureOtto calculusstationary distributions
0
0 comments X

The pith

Reinforcement learning policies are mapped into Wasserstein space so that policy optimization becomes a gradient flow with explicit second-order structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines policies as functions from states to probability measures over actions and equips the resulting space with a Riemannian metric drawn from the environment's stationary distributions. This metric makes it possible to apply Otto calculus and construct a gradient flow for a general reinforcement learning objective. The gradient and Hessian of the associated energy functional are derived directly, supplying a formal second-order analysis that standard first-order methods lack. Numerical examples confirm that the flow can be computed exactly in low dimensions and approximated via neural networks in higher dimensions.

Core claim

By viewing policies as maps into the Wasserstein space of action probabilities and inducing a Riemannian structure from stationary distributions, a general RL optimization problem admits a gradient flow whose direction and curvature are given explicitly by Otto calculus; the resulting gradient and Hessian provide the first- and second-order information needed to optimize policies in both low- and high-dimensional settings.

What carries the argument

The Riemannian structure induced by stationary distributions on the space of policies, which turns the tangent space of action probability measures into a metric space supporting geodesics and gradient flows.

If this is right

  • Any RL objective that can be written as an energy on policy space now possesses a well-defined gradient flow.
  • The Hessian supplies curvature information that can be used for accelerated or Newton-style updates.
  • Low-dimensional problems allow exact gradient computation without sampling approximations.
  • High-dimensional problems remain tractable by parameterizing the policy with a neural network and using an ergodic average of the cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometric construction may yield continuous-time limits of common discrete RL algorithms.
  • Convergence rates could be compared against standard policy-gradient methods on shared benchmark tasks.
  • Relaxing the stationary-distribution assumption might extend the framework to non-stationary or episodic settings.
  • Links to other optimal-transport applications in control could produce hybrid algorithms that move both states and actions in Wasserstein space.

Load-bearing premise

A Riemannian metric on the policy space can be defined from stationary distributions for arbitrary environments, and the vector fields that map states to tangent vectors remain measurable.

What would settle it

In a low-dimensional environment where the optimal policy is known analytically, compute the gradient flow trajectory directly from the derived formulas and check whether it converges to that known optimum.

Figures

Figures reproduced from arXiv: 2604.14765 by Mathias Dus (IRMA).

Figure 1
Figure 1. Figure 1: Convergence of the Average Cost dur￾ing Policy Iteration [PITH_FULL_IMAGE:figures/full_fig_p031_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simulated trajectory. Top: State evolution with and without noise. Bottom: Control actions [PITH_FULL_IMAGE:figures/full_fig_p031_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Grid-based results. Left: Value Function heatmap. Right: Policy mean action heatmap. [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Convergence of Average Cost (Grid) [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Direct Diff. Physics: Trajectory and Control [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Direct Diff. Physics: Trajectory and Control [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 14
Figure 14. Figure 14: Joint Training: Policy Loss (left) vs [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Direct Differentiable Physics: State trajectories (top) and Control inputs (bottom). [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Training Loss for Direct Differentiable Physics. [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: World Model Approach: State trajectories (left) and Control inputs (right). [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Joint Training Losses: Policy Loss (left) and World Model Prediction Loss (right). [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
read the original abstract

We present a geometric framework for Reinforcement Learning (RL) that views policies as maps into the Wasserstein space of action probabilities. First, we define a Riemannian structure induced by stationary distributions, proving its existence in a general context. We then define the tangent space of policies and characterize the geodesics, specifically addressing the measurability of vector fields mapped from the state space to the tangent space of probability measures over the action space. Next, we formulate a general RL optimization problem and construct a gradient flow using Otto's calculus. We compute the gradient and the Hessian of the energy, providing a formal second-order analysis. Finally, we illustrate the method with numerical examples for low-dimensional problems, computing the gradient directly from our theoretical formalism. For high-dimensional problems, we parameterize the policy using a neural network and optimize it based on an ergodic approximation of the cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a geometric framework for reinforcement learning by viewing policies as maps from states to the Wasserstein space of action probability measures. It defines a Riemannian structure on the policy space induced by stationary distributions (claiming existence in general MDPs), characterizes the tangent space and geodesics while addressing measurability of state-to-tangent vector fields, formulates RL as energy minimization, constructs the gradient flow via Otto calculus, derives the gradient and Hessian of the energy for second-order analysis, and illustrates the method with direct gradient computation on low-dimensional problems and neural-network parameterization with ergodic approximation on high-dimensional ones.

Significance. If the core constructions are rigorous, the work offers a novel optimal-transport perspective on policy optimization that could unify geometric methods with RL, with the formal second-order (Hessian) analysis providing a clear strength for analyzing convergence. The use of Otto calculus for the gradient flow is a technically interesting contribution, though the numerical examples remain illustrative rather than comparative.

major comments (2)
  1. [§2] §2 (Riemannian structure induced by stationary distributions): The existence claim for the Riemannian metric in a general context is load-bearing for the entire gradient-flow construction, yet the manuscript supplies no explicit regularity conditions (e.g., uniform ergodicity, positivity of the stationary measure, or Lipschitz continuity of the transition kernel) that guarantee the induced inner product is well-defined and non-degenerate for arbitrary policies in continuous or unbounded spaces; without these, the Otto-calculus gradient and Hessian are undefined on the full policy space.
  2. [§3] §3 (tangent space and measurability of vector fields): The characterization of measurable vector fields mapping states to the tangent space of action-probability measures is essential for the gradient flow to be rigorously defined, but the provided arguments do not state sufficient conditions ensuring measurability is preserved under the stationary-distribution weighting; this directly affects whether the derived gradient and Hessian are valid operators.
minor comments (2)
  1. [§5] §5 (numerical examples): The low-dimensional illustrations compute the gradient from the formalism but report no quantitative metrics (convergence rates, regret, or baseline comparisons), weakening the empirical support for the theoretical claims.
  2. [Abstract and §4] Abstract and §4: The high-dimensional case relies on an 'ergodic approximation' of the cost without specifying the approximation error or its effect on the Hessian analysis; a brief error bound would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. We address the two major comments point by point below, acknowledging where additional rigor is required and outlining the planned revisions.

read point-by-point responses
  1. Referee: [§2] §2 (Riemannian structure induced by stationary distributions): The existence claim for the Riemannian metric in a general context is load-bearing for the entire gradient-flow construction, yet the manuscript supplies no explicit regularity conditions (e.g., uniform ergodicity, positivity of the stationary measure, or Lipschitz continuity of the transition kernel) that guarantee the induced inner product is well-defined and non-degenerate for arbitrary policies in continuous or unbounded spaces; without these, the Otto-calculus gradient and Hessian are undefined on the full policy space.

    Authors: We agree that the existence of the Riemannian metric induced by stationary distributions requires explicit regularity conditions to be rigorously established for general MDPs, especially in continuous or unbounded spaces. Although the manuscript asserts a proof of existence in a general context, it does not enumerate the necessary assumptions (such as uniform ergodicity, positivity of the stationary measure, or Lipschitz continuity of the transition kernel) that ensure the inner product is well-defined and non-degenerate. In the revised version we will add a dedicated subsection in §2 that states these minimal conditions and shows how they guarantee the metric properties, thereby justifying the subsequent application of Otto calculus and the gradient/Hessian derivations. revision: yes

  2. Referee: [§3] §3 (tangent space and measurability of vector fields): The characterization of measurable vector fields mapping states to the tangent space of action-probability measures is essential for the gradient flow to be rigorously defined, but the provided arguments do not state sufficient conditions ensuring measurability is preserved under the stationary-distribution weighting; this directly affects whether the derived gradient and Hessian are valid operators.

    Authors: We concur that the arguments concerning measurability of state-to-tangent vector fields need to be strengthened by explicitly stating sufficient conditions under which measurability is preserved when the fields are weighted by the stationary distribution. The manuscript addresses measurability in §3, yet the conditions (for instance, joint measurability with respect to the product sigma-algebra and positivity of the stationary measure) are not stated with sufficient clarity. We will revise the relevant paragraphs in §3 to include these conditions, ensuring that the tangent-space characterization, gradient flow, and the derived gradient and Hessian operators are rigorously valid. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation builds on external Otto calculus and independent existence proof

full rationale

The paper first proves existence of the Riemannian structure induced by stationary distributions in a general context, then applies Otto's calculus to define the gradient flow on the policy space. Gradient and Hessian computations follow directly from the energy functional and tangent-space characterization without any reduction to fitted parameters, self-definitions, or self-citation chains. The measurability of vector fields is addressed as part of the tangent-space construction rather than presupposed as an output. No step equates a derived quantity to its own input by construction, and the framework remains self-contained against the cited external optimal-transport machinery.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central constructions rest on the existence of a Riemannian structure induced by stationary distributions and on measurability of the lifted vector fields; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Existence of a Riemannian structure on the space of policies induced by stationary distributions
    Stated as proved in a general context before defining tangent spaces and geodesics.
  • domain assumption Measurability of vector fields from state space to tangent space of action probability measures
    Explicitly addressed as a technical requirement for the geometric constructions.

pith-pipeline@v0.9.0 · 5443 in / 1419 out tokens · 45738 ms · 2026-05-10T12:08:53.672611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Springer, 3rd edition, 2006

    Charalambos D Aliprantis and Kim C Border.Infinite dimensional analysis: a hitchhiker’s guide. Springer, 3rd edition, 2006

  2. [2]

    Springer Science & Business Media, 2008

    Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré.Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008

  3. [3]

    Lipschitz continuity in model-based re- inforcement learning

    Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in model-based re- inforcement learning. In International Conference on Machine Learning (ICML), pages 264–273. PMLR, 2018

  4. [4]

    A computational fluid mechanics solution to the monge- kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

    Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to the monge- kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

  5. [5]

    Cambridge University Press, Cambridge, 1996

    Giuseppe Da Prato and Jerzy Zabczyk.Ergodicity for Infinite Dimensional Systems, volume 229 of London Mathematical Society Lecture Note Series. Cambridge University Press, Cambridge, 1996

  6. [6]

    Cambridge University Press, 5th edition, 2019

    Rick Durrett.Probability: Theory and Examples. Cambridge University Press, 5th edition, 2019

  7. [7]

    Ergodic properties of markov processes

    Martin Hairer. Ergodic properties of markov processes. Lecture notes, University of Warwick, 2006

  8. [8]

    Birkhäuser, 2003

    Onésimo Hernández-Lerma and Jean Bernard Lasserre.Markov chains and invariant probabilities, volume 211. Birkhäuser, 2003

  9. [9]

    A natural policy gradient

    Sham M Kakade. A natural policy gradient. InAdvances in neural information processing systems (NeurIPS), pages 1531–1538, 2001

  10. [10]

    Koralov and Yakov G

    Leonid B. Koralov and Yakov G. Sinai.Theory of Probability and Random Processes. Springer Berlin Heidelberg, 2nd edition, 2007

  11. [11]

    Meyn and Richard L

    Sean P. Meyn and Richard L. Tweedie.Markov Chains and Stochastic Stability. Cambridge Uni- versity Press, 2nd edition, 2009

  12. [12]

    The geometry of dissipative evolution equations: the porous medium equation.Com- munications in Partial Differential Equations, 26(1-2):101–174, 2001

    Felix Otto. The geometry of dissipative evolution equations: the porous medium equation.Com- munications in Partial Differential Equations, 26(1-2):101–174, 2001

  13. [13]

    Learning to score behaviors for guided policy optimization.arXiv preprint arXiv:2006.00000, 2020

    Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang, Krzysztof Choromanski, Anna Choromanska, and Michael I Jordan. Learning to score behaviors for guided policy optimization.arXiv preprint arXiv:2006.00000, 2020

  14. [14]

    David Pfau, Ian Davies, Diana Borsa, Joao G. M. Araujo, Brendan Tracey, and Hado van Hasselt. Wasserstein policy optimization. 2025

  15. [15]

    Policy gradient in lipschitz markov decision processes

    Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Policy gradient in lipschitz markov decision processes. Machine Learning, 100(2-3):255–283, 2015

  16. [16]

    Springer, 2015

    Filippo Santambrogio.Optimal transport for applied mathematicians, volume 55. Springer, 2015

  17. [17]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning (ICML), pages 1889–1897. PMLR, 2015

  18. [18]

    Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften

    Cédric Villani. Optimal Transport: Old and New, volume 338 ofGrundlehren der mathematischen Wissenschaften. Springer, Berlin, Heidelberg, 2009

  19. [19]

    Policy optimization as wasserstein gradient flows

    Ruiqi Zhang, Chen Chen, Chunyuan Li, and Lawrence Carin. Policy optimization as wasserstein gradient flows. In International Conference on Machine Learning (ICML), pages 12400–12410. PMLR, 2021

  20. [20]

    Wasserstein gradient flows for optimizing gaussian mixture policies

    Hanna Ziesche and Leonel Rozo. Wasserstein gradient flows for optimizing gaussian mixture policies. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 35