pith. sign in

arxiv: 2605.20718 · v1 · pith:JZONJ646new · submitted 2026-05-20 · 🧮 math.OC

Policy Gradient for Continuous-Time Mean-Field Control

Pith reviewed 2026-05-21 04:11 UTC · model grok-4.3

classification 🧮 math.OC
keywords mean-field controlpolicy gradientGâteaux derivativeHamilton-Jacobi-Bellman equationactor-criticMcKean-Vlasoventropy regularizationrandomized feedback policy
0
0 comments X

The pith

Policy gradients for continuous-time mean-field control follow directly from an explicit Gâteaux formula in the value function and instantaneous advantage function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a policy gradient method for entropy-regularized mean-field control in the discounted infinite-horizon setting. It considers randomized feedback policies acting on a representative particle whose state evolves jointly with a population law that satisfies a McKean-Vlasov equation, so the value function is defined on the product space of state and probability measures. After the value function is obtained by solving the linear stationary Hamilton-Jacobi-Bellman equation, the policy gradient is recovered from a first-order Gâteaux variation formula that uses only the value function and an instantaneous advantage function measuring the gain of a chosen action relative to the current policy. The resulting actor-critic procedure updates the policy parameters with this explicit direction, after the critic step represents the population dependence via cylindrical functions.

Core claim

After computing the value function under a fixed randomized feedback policy, the policy gradient is obtained directly via an explicit Gâteaux formula in terms of the value function and the instantaneous advantage function, without solving an additional equation.

What carries the argument

The Gâteaux policy-gradient formula expressed through the instantaneous advantage function that quantifies the gain of taking a given action relative to the current randomized policy.

If this is right

  • The method produces a model-based actor-critic scheme in which the critic solves the linear stationary HJB equation and the actor updates via the derived explicit gradient formula.
  • Well-posedness of the underlying PDE is established in the chosen polynomial-growth function class with cylindrical dependence on the population law.
  • The approach is illustrated by numerical experiments on an LQR model and a crowd-motion problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-PDE-per-iteration structure could reduce the number of coupled solves required in high-dimensional or many-agent control settings compared with methods that recompute a gradient-specific equation.
  • The cylindrical-function representation of measure dependence may transfer to other control problems where the state law enters the dynamics or cost.
  • The explicit advantage-based variation formula invites direct comparison with discrete-time mean-field policy gradients obtained by taking continuous-time limits.

Load-bearing premise

The value function belongs to a suitable polynomial-growth function class in which the linear stationary HJB equation is well-posed when the population law is represented via cylindrical functions.

What would settle it

A direct numerical check in which a small policy perturbation produces an objective change that fails to match the directional derivative predicted by the Gâteaux formula applied to the computed value function.

Figures

Figures reproduced from arXiv: 2605.20718 by Erhan Bayraktar, Martin Hernandez, Qinxin Yan, Yuhua Zhu.

Figure 1
Figure 1. Figure 1: Comparison between the learned and optimal solutions for different values of the population mean ¯µ. Top row: policy mean induced by the learned actor and optimal Riccati policy mean. Bottom row: value function associated with the learned policy and optimal Riccati value function. 0 200 400 600 800 1000 Iteration 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Parameter value 0 1 2 * (Riccati) 0 200 400 600 800 1000 Iteration… view at source ↗
Figure 2
Figure 2. Figure 2: Convergence of the finite-dimensional parameters in the scalar mean-field LQR example. Left: Galerkin critic coefficients θk compared with the Riccati coefficients θ ∗ . Right: actor coefficients ωk compared with the optimal Riccati feedback coefficients ω ∗ . and Gaussian for each fixed actor parameter, these trajectories are sampled from the exact transition law, while the population mean is propagated t… view at source ↗
Figure 3
Figure 3. Figure 3: Convergence of the finite-dimensional parameters in the scalar mean-field LQR example with undiscounted measure m. Left: Galerkin critic coefficients θk com￾pared with the Riccati coefficients θ ∗ . Right: actor coefficients ωk compared with the optimal Riccati feedback coefficients ω ∗ . pass through regions of high crowd density. The crowd is described by a population law µt, which is approximated numeri… view at source ↗
Figure 4
Figure 4. Figure 4: Training diagnostics for the crowd-aversion transport example. Left: esti￾mated entropy-regularized reward along the actor–critic iterations. Right: norms of the actor parameters, critic coefficients, and policy-gradient direction. 0 100 200 300 400 500 Iteration k 30000 25000 20000 15000 10000 5000 0 estimated regularized reward running mean 0 100 200 300 400 500 Iteration k 10 2 10 1 10 0 10 1 10 2 10 3 … view at source ↗
Figure 5
Figure 5. Figure 5: Training diagnostics for the crowd-aversion transport example using the undiscounted occupancy measure. Left: estimated entropy-regularized reward along the actor–critic iterations. Right: norms of the actor parameters, critic coefficients, and policy-gradient direction. 2 1 0 1 2 x1 0.6 0.4 0.2 0.0 0.2 0.4 x2 tagged trajectories tagged mean path population mean path trajectory envelope initial tagged stat… view at source ↗
Figure 6
Figure 6. Figure 6: Simulated trajectories after training. The plot shows representative trajec￾tories under the learned policy, together with the empirical population mean path and the target location. 8. Acknowledgements E. Bayraktar is supported in part by the NSF grants DMS-2507940, and 2406232, and in part by the Susan M. Smith Professorship. Y. Zhu and M. HERNANDEZ is supported in part by the NSF grants No 2529107. Appe… view at source ↗
read the original abstract

This paper develops a policy gradient method for entropy-regularized mean-field control in the discounted infinite-horizon setting. We consider randomized feedback policies and a coupled representative-particle/population system, in which the representative state evolves jointly with a population law governed by a McKean--Vlasov equation. The resulting value function is therefore defined on the product space $\mathbb R^d \times \mathcal P_2(\mathbb R^d)$. A key distinction from existing policy gradient methods for mean-field control is that, after computing the value function under a fixed policy, our approach does not require solving an additional equation to obtain the policy gradient. Instead, we derive an explicit policy gradient formula directly in terms of the value function. The formulation is based on an instantaneous advantage function, which quantifies the gain of taking a given action relative to the current randomized policy. We establish a G\^ateaux policy-gradient formula, which gives the first-order variation of the objective along arbitrary policy perturbations, and then derive the corresponding ascent direction under finite-dimensional policy parametrization. The resulting formula leads to a model-based actor--critic scheme. The critic is obtained by solving the associated linear stationary Hamilton--Jacobi--Bellman equation for the value function, using cylindrical functions to represent dependence on the population law. The actor is then updated according to the derived policy-gradient formula. We further analyze the well-posedness of the PDE in a polynomial-growth function class. Finally, we illustrate the proposed method through numerical experiments on an LQR model and a crowd-motion problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper develops a policy gradient method for entropy-regularized mean-field control in the continuous-time, discounted infinite-horizon setting. It considers randomized feedback policies acting on a coupled representative-particle and population system whose law evolves according to a McKean-Vlasov equation. The central contribution is an explicit Gâteaux formula expressing the first-order variation of the objective directly in terms of the value function and an instantaneous advantage function, obtained after solving only the linear stationary HJB equation; this yields a model-based actor-critic scheme in which the critic uses cylindrical functions to represent measure dependence and the actor is updated via the derived ascent direction. Well-posedness of the PDE is established in a polynomial-growth function class, and the method is illustrated numerically on an LQR example and a crowd-motion problem.

Significance. If the Gâteaux formula is valid, the approach offers a genuine simplification by eliminating the need for an auxiliary equation to compute the policy gradient, which is a practical advantage in mean-field settings. The combination of the explicit formula, cylindrical-function representation, and well-posedness result in a polynomial-growth class supplies a coherent theoretical framework. The numerical experiments on standard benchmark problems provide concrete evidence of implementability, although their scope remains limited to linear-quadratic and simple interaction models.

minor comments (3)
  1. [Abstract] The abstract states that the value function is defined on R^d × P_2(R^d) but does not indicate whether the cylindrical-function representation is introduced before or after the Gâteaux derivation; a brief forward reference would improve readability.
  2. [Well-posedness analysis] In the well-posedness analysis, the precise polynomial-growth exponents and the constants appearing in the a-priori estimates are not restated in the main theorem statement; adding them would make the result self-contained.
  3. [Numerical experiments] The numerical section reports results for the LQR and crowd-motion examples but does not specify the discretization parameters (time step, number of particles, or basis size for the cylindrical functions); these details are needed for reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful and accurate summary of our work, as well as the positive assessment of its significance. The recommendation for minor revision is appreciated. We note that the major comments section of the report is empty, so we have no specific points requiring point-by-point rebuttal at this stage. We are happy to implement any minor clarifications or improvements the editor or referee may suggest in the next version.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation computes the value function by solving the linear stationary HJB equation under a fixed randomized feedback policy, then obtains the policy gradient via an explicit Gâteaux formula expressed directly in terms of that value function and the instantaneous advantage function. The HJB residual is used to absorb future costs and measure-coupling effects, so the gradient expression does not reduce to a fitted parameter, a renamed input, or a self-citation chain. The central claim therefore retains independent mathematical content and is self-contained against the stated well-posedness assumptions on polynomial-growth cylindrical functions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard existence and uniqueness results for McKean-Vlasov equations and on the well-posedness of the linear stationary HJB in a polynomial-growth class; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The McKean-Vlasov equation governing the population law admits a unique solution for the chosen randomized feedback policies.
    Invoked when defining the coupled representative-particle/population system.
  • domain assumption The value function lies in a polynomial-growth function class where the linear stationary HJB equation is well-posed under cylindrical representations of the population measure.
    Stated in the well-posedness analysis paragraph.

pith-pipeline@v0.9.0 · 5820 in / 1572 out tokens · 32892 ms · 2026-05-21T04:11:54.306972+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages

  1. [1]

    Mean field type control with congestion.Applied Mathematics & Optimization, 73(3):393–418, 2016

    Yves Achdou and Mathieu Lauri` ere. Mean field type control with congestion.Applied Mathematics & Optimization, 73(3):393–418, 2016

  2. [2]

    Mean field type control with congestion II: An augmented lagrangian method

    Yves Achdou and Mathieu Lauri` ere. Mean field type control with congestion II: An augmented lagrangian method. Applied Mathematics & Optimization, 74(3):535–578, 2016

  3. [3]

    A maximum principle for SDEs of mean-field type.Applied Mathematics and Optimization, 63(3):341–356, 2011

    Daniel Andersson and Boualem Djehiche. A maximum principle for SDEs of mean-field type.Applied Mathematics and Optimization, 63(3):341–356, 2011

  4. [4]

    Alex Ayoub, Zeyu Jia, Csaba Szepesv´ ari, Mengdi Wang, and Lin F. Yang. Model-based reinforcement learning with value-targeted regression. InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 463–474. PMLR, 2020

  5. [5]

    Leemon C. Baird. Reinforcement learning in continuous time: advantage updating.Proceedings of 1994 IEEE Inter- national Conference on Neural Networks (ICNN’94), 4:2448–2453 vol.4, 1994

  6. [6]

    A weak martingale approach to linear-quadratic McKean–Vlasov stochastic control problems.Journal of Optimization Theory and Applications, 181(2):347–382, 2019

    Matteo Basei and Huyˆ en Pham. A weak martingale approach to linear-quadratic McKean–Vlasov stochastic control problems.Journal of Optimization Theory and Applications, 181(2):347–382, 2019

  7. [7]

    Viscosity solutions of fully second-order hjb equations in the wasserstein space.SIAM Journal on Control and Optimization, 2026

    Erhan Bayraktar, Hang Cheung, Ibrahim Ekren, Jinniao Qiu, Ho Man Tai, and Xin Zhang. Viscosity solutions of fully second-order hjb equations in the wasserstein space.SIAM Journal on Control and Optimization, 2026. To appear

  8. [8]

    Erhan Bayraktar, Andrea Cosso, and Huyˆ en Pham. Randomized dynamic programming principle and feynman–kac representation for optimal control of mckean–vlasov dynamics.Transactions of the American Mathematical Society, 370(3):2115–2165, 2018

  9. [9]

    Comparison for semi-continuous viscosity solutions for second order pdes on the wasserstein space.Journal of Differential Equations, 455:1–31, 2026

    Erhan Bayraktar, Ibrahim Ekren, Xihao He, and Xin Zhang. Comparison for semi-continuous viscosity solutions for second order pdes on the wasserstein space.Journal of Differential Equations, 455:1–31, 2026

  10. [10]

    Comparison of viscosity solutions for a class of second-order pdes on the wasserstein space.Communications in Partial Differential Equations, 50(4):570–613, 2025

    Erhan Bayraktar, Ibrahim Ekren, and Xin Zhang. Comparison of viscosity solutions for a class of second-order pdes on the wasserstein space.Communications in Partial Differential Equations, 50(4):570–613, 2025

  11. [11]

    Convergence rate of particle system for second-order pdes on wasserstein space.SIAM Journal on Control and Optimization, 63(3):1515–1782, 2025

    Erhan Bayraktar, Ibrahim Ekren, and Xin Zhang. Convergence rate of particle system for second-order pdes on wasserstein space.SIAM Journal on Control and Optimization, 63(3):1515–1782, 2025

  12. [12]

    Mean-field phibe: Continuous-time mean-field reinforcement learning from discrete-time data.Preprint, 2026

    Erhan Bayraktar, Martin Hernandez, Qinxin Yan, and Yuhua Zhu. Mean-field phibe: Continuous-time mean-field reinforcement learning from discrete-time data.Preprint, 2026. arXiv preprint

  13. [13]

    Ergodicity and turnpike properties of linear-quadratic mean field control problems

    Erhan Bayraktar and Jiamin Jian. Ergodicity and turnpike properties of linear-quadratic mean field control problems. arXiv preprint, 2024

  14. [14]

    Convergence and turnpike properties of linear-quadratic mean field control problems with common noise.arXiv preprint arXiv, 2025

    Erhan Bayraktar and Jiamin Jian. Convergence and turnpike properties of linear-quadratic mean field control problems with common noise.arXiv preprint arXiv, 2025

  15. [15]

    Learning with linear function approximations in mean-field control.Journal of Machine Learning Research, 26(192):1–53, 2025

    Erhan Bayraktar and Ali Devran Kara. Learning with linear function approximations in mean-field control.Journal of Machine Learning Research, 26(192):1–53, 2025

  16. [16]

    Propagation of chaos of forward-backward stochastic differential equa- tions with graphon interactions.Applied Mathematics and Optimization, 88(1):Article 25, 2023

    Erhan Bayraktar, Ruoyu Wu, and Xin Zhang. Propagation of chaos of forward-backward stochastic differential equa- tions with graphon interactions.Applied Mathematics and Optimization, 88(1):Article 25, 2023

  17. [17]

    SpringerBriefs in Mathematics

    Alain Bensoussan, Jens Frehse, and Phillip Yam.Mean field games and mean field type control theory. SpringerBriefs in Mathematics. Springer, New York, 2013. 36

  18. [18]

    The pontryagin maximum principle in the wasserstein space.Calculus of Varia- tions and Partial Differential Equations, 58:1–36, 2017

    Benoˆ ıt Bonnet and Francesco Rossi. The pontryagin maximum principle in the wasserstein space.Calculus of Varia- tions and Partial Differential Equations, 58:1–36, 2017

  19. [19]

    A general maximum principle for SDEs of mean-field type.Applied Mathematics and Optimization, 64(2):197–216, 2011

    Rainer Buckdahn, Boualem Djehiche, and Juan Li. A general maximum principle for SDEs of mean-field type.Applied Mathematics and Optimization, 64(2):197–216, 2011

  20. [20]

    Mean-field stochastic differential equations and associated pdes.Annals of Probability, 45(2):824–878, 2017

    Rainer Buckdahn, Juan Li, Shige Peng, and Chao Rainer. Mean-field stochastic differential equations and associated pdes.Annals of Probability, 45(2):824–878, 2017

  21. [21]

    Max Reppen, and H

    Matteo Burzoni, Vincenzo Ignazio, A. Max Reppen, and H. Mete Soner. Viscosity solutions for controlled McKean– Vlasov jump-diffusions.SIAM Journal on Control and Optimization, 58(3):1676–1699, 2020

  22. [22]

    Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2016

    Ren´ e Carmona.Lectures on BSDEs, stochastic control, and stochastic differential games with financial applications, volume 1 ofFinancial Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2016

  23. [23]

    Forward–backward stochastic differential equations and controlled McKean– Vlasov dynamics.Annals of Probability, 43(5):2647–2700, 2015

    Ren´ e Carmona and Fran¸ cois Delarue. Forward–backward stochastic differential equations and controlled McKean– Vlasov dynamics.Annals of Probability, 43(5):2647–2700, 2015

  24. [24]

    I, volume 83 of Probability Theory and Stochastic Modelling

    Ren´ e Carmona and Fran¸ cois Delarue.Probabilistic theory of mean field games with applications. I, volume 83 of Probability Theory and Stochastic Modelling. Springer, Cham, 2018. Mean field FBSDEs, control, and games

  25. [25]

    II, volume 84 of Probability Theory and Stochastic Modelling

    Ren´ e Carmona and Fran¸ cois Delarue.Probabilistic theory of mean field games with applications. II, volume 84 of Probability Theory and Stochastic Modelling. Springer, Cham, 2018. Mean field games with common noise and master equations

  26. [26]

    Control of McKean–Vlasov dynamics versus mean field games

    Ren´ e Carmona, Fran¸ cois Delarue, and Aim´ e Lachapelle. Control of McKean–Vlasov dynamics versus mean field games. Mathematics and Financial Economics, 7(2):131–166, 2013

  27. [27]

    Mean field games and systemic risk.Communications in Mathematical Sciences, 13(4):911–933, 2015

    Ren´ e Carmona, Jean-Pierre Fouque, and Li-Hsien Sun. Mean field games and systemic risk.Communications in Mathematical Sciences, 13(4):911–933, 2015

  28. [28]

    Ren´ e Carmona and Mathieu Lauri` ere. Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: II — the finite horizon case.Annals of Applied Probability, 32(6):4065–4105, 2022

  29. [29]

    Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods.Preprint arXiv:1910.04295, 2019

    Ren´ e Carmona, Mathieu Lauri` ere, and Zongjun Tan. Linear-quadratic mean-field reinforcement learning: Convergence of policy gradient methods.Preprint arXiv:1910.04295, 2019

  30. [30]

    Carrillo, Young-Pil Choi, and Maxime Hauray

    Jos´ e A. Carrillo, Young-Pil Choi, and Maxime Hauray. The derivation of swarming models: Mean-field limit and Wasserstein distances. InCollective Dynamics from Bacteria to Crowds, CISM International Centre for Mechanical Sciences, pages 1–46. Springer, 2014

  31. [31]

    Carrillo, Massimo Fornasier, Giuseppe Toscani, and Francesco Vecil

    Jos´ e A. Carrillo, Massimo Fornasier, Giuseppe Toscani, and Francesco Vecil. Particle, kinetic, and hydrodynamic models of swarming. InMathematical Modeling of Collective Behavior in Socio-Economic and Life Sciences, Modeling and Simulation in Science, Engineering and Technology, pages 297–336. Birkh¨ auser Boston, 2010

  32. [32]

    Propagation of chaos: A review of models, methods and applications

    Louis-Pierre Chaintron and Antoine Diez. Propagation of chaos: A review of models, methods and applications. I. models and methods.Kinetic and Related Models, 15(6):895–1015, 2022

  33. [33]

    Numerical method for FBSDEs of McKean–Vlasov type.Annals of Applied Probability, 29(3):1640–1684, 2019

    Jean-Fran¸ cois Chassagneux, Dan Crisan, and Fran¸ cois Delarue. Numerical method for FBSDEs of McKean–Vlasov type.Annals of Applied Probability, 29(3):1640–1684, 2019

  34. [34]

    Emergent behavior in flocks.IEEE Transactions on Automatic Control, 52(5):852–862, 2007

    Felipe Cucker and Steve Smale. Emergent behavior in flocks.IEEE Transactions on Automatic Control, 52(5):852–862, 2007

  35. [35]

    Mckean–vlasov optimal control: Limit theory and equivalence between different formulations.Mathematics of Operations Research, 47(4):2891–2930, 2022

    Mao Fabrice Djete, Dylan Possama¨ ı, and Xiaolu Tan. Mckean–vlasov optimal control: Limit theory and equivalence between different formulations.Mathematics of Operations Research, 47(4):2891–2930, 2022

  36. [36]

    Martingale measures and stochastic calculus.Probability Theory and Related Fields, 84(1–2):83–101, 1990

    Nicole El Karoui and Sylvie M´ el´ eard. Martingale measures and stochastic calculus.Probability Theory and Related Fields, 84(1–2):83–101, 1990

  37. [37]

    Actor-critic learning for mean-field control in continuous time.J

    Noufel Frikha, Maximilien Germain, Mathieu Lauriere, Huyˆ en Pham, and Xuanye Song. Actor-critic learning for mean-field control in continuous time.J. Mach. Learn. Res., 26:Paper No. [127], 42, 2025

  38. [38]

    Noufel Frikha, Huyˆ en Pham, and Xuanye Song. Full error analysis of policy gradient learning algorithms for exploratory linear quadratic mean-field control problem in continuous time with common noise.Preprint arXiv:2408.02489, 2024

  39. [39]

    Hamilton-Jacobi equations in the Wasserstein space.Methods Appl

    Wilfrid Gangbo, Truyen Nguyen, and Adrian Tudorascu. Hamilton-Jacobi equations in the Wasserstein space.Methods Appl. Anal., 15(2):155–183, 2008

  40. [40]

    Opinion dynamics and bounded confidence: Models, analysis and simulation

    Rainer Hegselmann and Ulrich Krause. Opinion dynamics and bounded confidence: Models, analysis and simulation. Journal of Artificial Societies and Social Simulation, 5(3), 2002

  41. [41]

    Howard.Dynamic programming and Markov processes

    Ronald A. Howard.Dynamic programming and Markov processes. Technology Press of M.I.T., Cambridge, MA; John Wiley & Sons, Inc., New York-London, 1960

  42. [42]

    A linear-quadratic optimal control problem for mean-field stochastic differential equations in infinite horizon.Math

    Jianhui Huang, Xun Li, and Jiongmin Yong. A linear-quadratic optimal control problem for mean-field stochastic differential equations in infinite horizon.Math. Control Relat. Fields, 5(1):97–139, 2015

  43. [43]

    Infinite horizon value functions in the Wasserstein spaces.J

    Ryan Hynd and Hwa Kil Kim. Infinite horizon value functions in the Wasserstein spaces.J. Differential Equations, 258(6):1933–1966, 2015

  44. [44]

    Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning

    Yanwei Jia, Du Ouyang, and Yufei Zhang. Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning. Preprint arXiv:2503.09981, 2025

  45. [45]

    Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach.Journal of Machine Learning Research, 23(1):6918–6972, 2022

    Yanwei Jia and Xun Yu Zhou. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach.Journal of Machine Learning Research, 23(1):6918–6972, 2022

  46. [46]

    Policy gradient and actor-critic learning in continuous time and space.Journal of Machine Learning Research, 23(1):12603–12652, 2022

    Yanwei Jia and Xun Yu Zhou. Policy gradient and actor-critic learning in continuous time and space.Journal of Machine Learning Research, 23(1):12603–12652, 2022

  47. [47]

    Yanwei Jia and Xun Yu Zhou.q-learning in continuous time.Journal of Machine Learning Research, 24(161):1–61, 2023

  48. [48]

    Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. InProceedings of the Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 2137–2143. PMLR, 2020

  49. [49]

    Limit theory for controlled McKean–Vlasov dynamics.SIAM Journal on Control and Optimization, 55(3):1641–1672, 2017

    Daniel Lacker. Limit theory for controlled McKean–Vlasov dynamics.SIAM Journal on Control and Optimization, 55(3):1641–1672, 2017. 37

  50. [50]

    Mean field games.Japanese Journal of Mathematics, 2(1):229–260, 2007

    Jean-Michel Lasry and Pierre-Louis Lions. Mean field games.Japanese Journal of Mathematics, 2(1):229–260, 2007

  51. [51]

    Numerical methods for mean field games and mean field type control.Proceedings of Symposia in Applied Mathematics, 78:221–282, 2021

    Mathieu Lauri` ere. Numerical methods for mean field games and mean field type control.Proceedings of Symposia in Applied Mathematics, 78:221–282, 2021

  52. [52]

    Dynamic programming for mean-field type control.Comptes Rendus Math´ ematique

    Mathieu Lauri` ere and Olivier Pironneau. Dynamic programming for mean-field type control.Comptes Rendus Math´ ematique. Acad´ emie des Sciences. Paris, 352(9):707–713, 2014

  53. [53]

    Mean-field stochastic linear quadratic optimal control problems: Closed-loop solvability.Probability, Uncertainty and Quantitative Risk, 1(1):2, 2016

    Xun Li, Jingrui Sun, and Jiongmin Yong. Mean-field stochastic linear quadratic optimal control problems: Closed-loop solvability.Probability, Uncertainty and Quantitative Risk, 1(1):2, 2016

  54. [54]

    Cours au Coll` ege de France: Th´ eorie des jeux ` a champ moyen, 2012

    Pierre-Louis Lions. Cours au Coll` ege de France: Th´ eorie des jeux ` a champ moyen, 2012. Audio lectures, 2006–2012

  55. [55]

    Linear quadratic optimal control of conditional McKean–Vlasov equation with random coefficients and applications.Journal of Mathematical Economics, 66:7–26, 2016

    Huyˆ en Pham. Linear quadratic optimal control of conditional McKean–Vlasov equation with random coefficients and applications.Journal of Mathematical Economics, 66:7–26, 2016

  56. [56]

    Actor-critic learning algorithms for mean-field control with moment neural networks

    Huyˆ en Pham and Xavier Warin. Actor-critic learning algorithms for mean-field control with moment neural networks. Methodology and Computing in Applied Probability, 27(13), 2025

  57. [57]

    Dynamic programming for optimal control of stochastic McKean–Vlasov dynamics

    Huyˆ en Pham and Xiaoli Wei. Dynamic programming for optimal control of stochastic McKean–Vlasov dynamics. SIAM Journal on Control and Optimization, 55(2):1069–1101, 2017

  58. [58]

    Bellman equation and viscosity solutions for mean-field stochastic control problem

    Huyˆ en Pham and Xiaoli Wei. Bellman equation and viscosity solutions for mean-field stochastic control problem. ESAIM Control Optim. Calc. Var., 24(1):437–461, 2018

  59. [59]

    Mean-field neural networks: Learning mappings on wasserstein space.Neural Net- works, 168:380–393, 2023

    Huyˆ en Pham and Xavier Warin. Mean-field neural networks: Learning mappings on wasserstein space.Neural Net- works, 168:380–393, 2023

  60. [60]

    Continuous-time q-learning for mean-field control with common noise, part-i: Theoretical foundations, 2026

    Zhenjie Ren, Xiaoli Wei, Xiang Yu, and Xun Yu Zhou. Continuous-time q-learning for mean-field control with common noise, part-i: Theoretical foundations, 2026

  61. [61]

    Continuous-time q-learning for mean-field control with common noise, part-ii: q-learning algorithms, 2026

    Zhenjie Ren, Xiaoli Wei, Xiang Yu, and Xun Yu Zhou. Continuous-time q-learning for mean-field control with common noise, part-ii: q-learning algorithms, 2026

  62. [62]

    Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung

    Lars Ruthotto, Stanley J. Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems.Proceedings of the National Academy of Sciences, 117(17):9183–9193, 2020

  63. [63]

    Mete Soner, Josef Teichmann, and Qinxin Yan

    H. Mete Soner, Josef Teichmann, and Qinxin Yan. Learning algorithms for mean field optimal control.Preprint arXiv:2503.17869, 2025

  64. [64]

    Mete Soner and Qinxin Yan

    H. Mete Soner and Qinxin Yan. Viscosity solutions for McKean–Vlasov control on a torus.SIAM Journal on Control and Optimization, 62(2):903–923, 2024

  65. [65]

    Mete Soner and Qinxin Yan

    H. Mete Soner and Qinxin Yan. Viscosity solutions of the eikonal equation on the wasserstein space.Applied Mathe- matics & Optimization, 90(1), 2024

  66. [66]

    Mean-field stochastic linear quadratic optimal control problems: Open-loop solvabilities.ESAIM: Control, Optimisation and Calculus of Variations, 23(3):1099–1127, 2017

    Jingrui Sun. Mean-field stochastic linear quadratic optimal control problems: Open-loop solvabilities.ESAIM: Control, Optimisation and Calculus of Variations, 23(3):1099–1127, 2017

  67. [67]

    Mean-field stochastic linear-quadratic optimal control problems: Weak closed-loop solvability.Mathematical Control and Related Fields, 11(1):47–71, 2021

    Jingrui Sun and Hanxiao Wang. Mean-field stochastic linear-quadratic optimal control problems: Weak closed-loop solvability.Mathematical Control and Related Fields, 11(1):47–71, 2021

  68. [68]

    Springer Nature, 2020

    Jingrui Sun and Jiongmin Yong.Stochastic linear-quadratic optimal control theory: differential games and mean-field problems. Springer Nature, 2020

  69. [69]

    The exact law of large numbers via fubini extension and characterization of insurable risks.Journal of Economic Theory, 126(1):31–69, 2006

    Yeneng Sun. The exact law of large numbers via fubini extension and characterization of insurable risks.Journal of Economic Theory, 126(1):31–69, 2006

  70. [70]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement learning: an introduction. Adaptive Computation and Ma- chine Learning. MIT Press, Cambridge, MA, second edition, 2018

  71. [71]

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    Csaba Szepesv´ ari.Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2010

  72. [72]

    Topics in propagation of chaos

    Alain-Sol Sznitman. Topics in propagation of chaos. In ´Ecole d’ ´Et´ e de Probabilit´ es de Saint-Flour XIX—1989, volume 1464 ofLecture Notes in Mathematics, pages 165–251. Springer, Berlin, Heidelberg, 1991

  73. [73]

    Optimal scheduling of entropy regularizer for continuous- time linear-quadratic reinforcement learning.SIAM Journal on Control and Optimization, 62(1):135–166, 2024

    Lukasz Szpruch, Tanut Treetanthiploet, and Yufei Zhang. Optimal scheduling of entropy regularizer for continuous- time linear-quadratic reinforcement learning.SIAM Journal on Control and Optimization, 62(1):135–166, 2024

  74. [74]

    Making deep Q-learning methods robust to time discretization

    Corentin Tallec, L´ eonard Blier, and Yann Ollivier. Making deep Q-learning methods robust to time discretization. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 ofPMLR, pages 6096– 6104, 2019

  75. [75]

    Kinetic models of opinion formation.Communications in Mathematical Sciences, 4(3):481–496, 2006

    Giuseppe Toscani. Kinetic models of opinion formation.Communications in Mathematical Sciences, 4(3):481–496, 2006

  76. [76]

    Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

    Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

  77. [77]

    Global convergence of policy gradient for linear- quadratic mean-field control/game in continuous time

    Weichen Wang, Jiequn Han, Zhuoran Yang, and Zhaoran Wang. Global convergence of policy gradient for linear- quadratic mean-field control/game in continuous time. InProceedings of the 38th International Conference on Machine Learning (ICML), volume 139 ofPMLR, pages 10772–10782, 2021

  78. [78]

    Continuous time q-learning for mean-field control problems.Applied Mathematics & Opti- mization, 91(1):10, 2025

    Xiaoli Wei and Xiang Yu. Continuous time q-learning for mean-field control problems.Applied Mathematics & Opti- mization, 91(1):10, 2025

  79. [79]

    Unified continuous-time q-learning for mean-field game and mean-field control problems.ArXiv, abs/2407.04521, 2024

    Xiaoli Wei, Xiang Yu, and Fengyi Yuan. Unified continuous-time q-learning for mean-field game and mean-field control problems.ArXiv, abs/2407.04521, 2024

  80. [80]

    Efficient local planning with linear function approximation

    Dong Yin, Botao Hao, Yasin Abbasi-Yadkori, Nevena Lazi´ c, and Csaba Szepesv´ ari. Efficient local planning with linear function approximation. InProceedings of the 33rd International Conference on Algorithmic Learning Theory, volume 167 ofProceedings of Machine Learning Research, pages 1165–1192. PMLR, 2022

Showing first 80 references.